[00:00:28] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319767 (owner: 10Yuvipanda) [00:10:34] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [00:12:17] yuvipanda: ^ [00:12:30] whoops [00:12:31] merged [00:12:39] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2770593 (10Pnorman) Some discussions about the equivalent OSM usage policies: https://github.com/openstreetmap/operations/issues/113 [00:12:58] thx [00:13:10] (03PS1) 10Yuvipanda: tools: Add docker-builder class hosts to clush [puppet] - 10https://gerrit.wikimedia.org/r/319769 [00:13:34] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [00:13:42] * mutante tries 'puppet-check' [00:14:14] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319769 (owner: 10Yuvipanda) [00:20:14] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:23:34] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [00:24:09] (03PS1) 10Dzahn: confluent::kafka::mirror::jmxtrans: key attr is declared more than once [puppet] - 10https://gerrit.wikimedia.org/r/319770 [00:24:34] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 76765 bytes in 0.776 second response time [00:24:46] (03PS2) 10Dzahn: confluent::kafka::mirror::jmxtrans: key attr is declared more than once [puppet] - 10https://gerrit.wikimedia.org/r/319770 [00:41:32] (03CR) 10Catrope: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) (owner: 10Catrope) [00:44:35] (03PS2) 10Catrope: Disable Flow on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) [00:44:39] (03CR) 10Catrope: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) (owner: 10Catrope) [00:44:43] (03CR) 10Catrope: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) (owner: 10Catrope) [00:45:13] (03Merged) 10jenkins-bot: Disable Flow on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) (owner: 10Catrope) [00:48:20] !log catrope@tin Synchronized dblists/: Disable Flow on enwiki (T148611) (duration: 01m 04s) [00:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:27] T148611: Plan to disable Flow on Enwiki - https://phabricator.wikimedia.org/T148611 [00:49:17] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [00:50:55] (03PS2) 10Reedy: Elevate password policies for all users on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319598 (https://phabricator.wikimedia.org/T149638) [00:50:57] (03PS2) 10Reedy: Increase password requirements on enwiki for "Abuse filter editors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319000 (https://phabricator.wikimedia.org/T121186) [00:51:27] PROBLEM - HHVM rendering on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [00:52:27] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 76749 bytes in 0.411 second response time [00:53:47] RoanKattouw, :weep: [00:54:17] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:17] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [01:02:15] (03CR) 10Dzahn: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319123 (https://phabricator.wikimedia.org/T143138) (owner: 10Dereckson) [01:02:29] (03CR) 10Filippo Giunchedi: Thanks Riccardo and Alex! Apologies for the under-indentation all over the place heh. (0315 comments) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [01:03:19] (03PS4) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) [01:04:31] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:15:11] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 5269 MB (9% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:16:24] (03CR) 10Alex Monk: [C: 031] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319600 (owner: 10Reedy) [01:17:50] !log catrope@terbium Started scap: (no message) [01:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:11] !log catrope@terbium scap failed: IOError [Errno 13] Permission denied: u'/srv/mediawiki-staging/wmf-config/ExtensionMessages-1.29.0-wmf.1.php' (duration: 00m 20s) [01:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:24] RoanKattouw: from terbium? [01:18:40] Oh ugh [01:18:48] I ran scap sync instead of scap pull [01:18:56] ah [01:19:16] That... probably shouldn't have been allowed to get that far [01:19:20] it... probably should stop you before [01:19:21] yea [01:19:32] quick! file a bug! [01:20:11] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 4268 MB (8% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:20:15] hows grrrit-wm doing so far? [01:20:41] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [01:20:41] PROBLEM - HHVM rendering on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [01:21:41] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.043 second response time [01:21:41] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 76749 bytes in 0.167 second response time [01:25:11] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 3381 MB (6% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:27:04] (03PS1) 10BBlack: Bugfix for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319775 [01:27:06] (03PS1) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [01:30:03] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 2431 MB (4% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:32:33] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:35:03] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 1425 MB (2% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:35:58] !log catrope@tin Synchronized php-1.29.0-wmf.1/extensions/Thanks: Avoid breakage after Flow uninstallation (duration: 00m 47s) [01:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:32] (03CR) 10BBlack: You might want to double-check that nginx doesn't let the 429 look cacheable, or it could be cached for other clients' queries of the same URL [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) (owner: 10Smalyshev) [01:40:04] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 489 MB (0% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:45:13] PROBLEM - check_disk on bismuth is CRITICAL: DISK CRITICAL - free space: / 425 MB (0% inode=90%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [01:49:33] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [01:50:33] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76740 bytes in 2.477 second response time [01:54:23] !log Manually reimaging labstore2003 (T149870) [01:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:29] T149870: Set up backups of tools and misc data from labstore1004/5 in labstore2003/4 - https://phabricator.wikimedia.org/T149870 [01:57:35] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:26:35] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:26:49] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 09m 08s) [02:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 4 02:31:26 UTC 2016 (duration 4m 39s) [02:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:22] PROBLEM - Apache HTTP on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [02:48:22] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.119 second response time [02:52:52] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:15:30] PROBLEM - HHVM rendering on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [03:16:30] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 77278 bytes in 0.205 second response time [03:20:50] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:27:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 773.94 seconds [03:28:12] (03PS1) 10Madhuvishy: labstore: Apply role secondary::backup::tools-project to labstore2003 [puppet] - 10https://gerrit.wikimedia.org/r/319781 (https://phabricator.wikimedia.org/T149870) [03:29:54] (03CR) 10Madhuvishy: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319781 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [03:33:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.35 seconds [03:44:17] PROBLEM - Apache HTTP on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [03:44:27] PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [03:45:27] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.629 second response time [03:45:27] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 77280 bytes in 3.554 second response time [04:02:17] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:02:27] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:47] PROBLEM - HHVM rendering on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [04:14:47] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 77278 bytes in 0.258 second response time [04:19:17] PROBLEM - Disk space on einsteinium is CRITICAL: DISK CRITICAL - free space: / 1773 MB (3% inode=97%) [04:19:37] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [04:32:29] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [04:35:10] fixed salt thing on labstore2003 [04:35:39] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:42:19] PROBLEM - HHVM rendering on mw1199 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [04:43:19] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 77278 bytes in 0.133 second response time [05:11:31] PROBLEM - Apache HTTP on mw1196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [05:11:32] PROBLEM - HHVM rendering on mw1196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [05:12:31] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.034 second response time [05:12:31] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 77277 bytes in 0.120 second response time [05:39:16] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [05:39:36] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [05:40:16] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 77270 bytes in 0.125 second response time [05:40:36] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.023 second response time [05:52:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 649 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3050697 keys, up 3 days 21 hours - replication_delay is 649 [05:58:14] (03PS1) 10Yuvipanda: admin: Add .vimrc for yuvipanda [puppet] - 10https://gerrit.wikimedia.org/r/319787 [05:58:30] (03PS2) 10Yuvipanda: admin: Add .vimrc for yuvipanda [puppet] - 10https://gerrit.wikimedia.org/r/319787 [05:58:36] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319787 (owner: 10Yuvipanda) [05:58:54] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3040901 keys, up 3 days 21 hours - replication_delay is 0 [06:08:14] PROBLEM - Apache HTTP on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [06:08:44] PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [06:09:14] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.027 second response time [06:09:44] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 77269 bytes in 0.221 second response time [06:34:18] RECOVERY - Disk space on einsteinium is OK: DISK OK [06:37:18] PROBLEM - HHVM rendering on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [06:37:18] PROBLEM - Apache HTTP on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [06:38:18] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 77279 bytes in 0.162 second response time [06:38:18] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.037 second response time [06:48:38] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3041 is CRITICAL: connect to address 10.20.0.176 and port 3128: Connection refused [06:49:06] <_joe_> looking ^^ [06:52:38] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3041 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.167 second response time [06:52:59] <_joe_> !log restarted manually varnish text-backend on cp3041 - failing automatic restarts with "no space left on device" [06:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:29] (03PS2) 10Jcrespo: mariadb: improve accuracy of replication lag check [puppet] - 10https://gerrit.wikimedia.org/r/318685 [07:04:33] 06Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964#2770845 (10jcrespo) [07:04:35] 06Operations, 10ops-eqiad, 10DBA: db1051 disk is about to fail - https://phabricator.wikimedia.org/T149908#2770847 (10jcrespo) [07:06:27] PROBLEM - Apache HTTP on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [07:06:27] PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [07:07:26] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.023 second response time [07:07:26] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 77279 bytes in 0.241 second response time [07:08:52] (03PS3) 10Giuseppe Lavagetto: RESTBase: Use the LVS Realserver role [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [07:11:36] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:13:10] (03PS1) 10Yuvipanda: labs: Add a script that dumps instance info onto a public url [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) [07:16:26] (03PS2) 10Yuvipanda: labs: Add a script that dumps instance info onto a public url [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) [07:16:55] (03PS3) 10Yuvipanda: labs: Add a script that dumps instance info onto a public url [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) [07:20:00] !log disabling alerting for slave lag fleet-wide for 1 hour to deploy new alerting script [07:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:52] (03PS4) 10Yuvipanda: labs: Add a script that dumps instance info onto a public url [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) [07:20:59] (03CR) 10Jcrespo: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/318685 (owner: 10Jcrespo) [07:23:00] (03PS5) 10Yuvipanda: labs: Add a script that dumps instance info onto a public url [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) [07:23:07] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319788 (https://phabricator.wikimedia.org/T143136) (owner: 10Yuvipanda) [07:23:55] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2770879 (10Marostegui) dbstore2002 caught up. \o/ I am going to do a few tests to make sure it is fine and if so, on Sunday I will take a final snapshot of dbstore2001, move it to dbstore2002 and... [07:28:23] (03CR) 10Giuseppe Lavagetto: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [07:28:29] (03PS4) 10Giuseppe Lavagetto: RESTBase: Use the LVS Realserver role [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [07:28:36] \o/ [07:28:50] (03CR) 10Giuseppe Lavagetto: [V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [07:32:52] <_joe_> mobrovac: uhm something is wrong, I still don't know what [07:33:00] uh? [07:33:05] <_joe_> but it's ok, I disabled puppet everywhere [07:33:20] hm [07:34:22] <_joe_> mobrovac: the role people use is "role::restbase::server" I think, wtf :P [07:34:57] oh right [07:34:58] damn [07:35:00] <_joe_> that's one of the fake names invented during the transition [07:35:04] <_joe_> whatever :) [07:35:14] did notice this change [07:35:20] *didn't [07:35:28] <_joe_> well it must be very old :) [07:35:44] ok, will create a patch [07:35:57] <_joe_> I'm doing it [07:36:26] (03PS1) 10Giuseppe Lavagetto: restbase: fix hieradata location [puppet] - 10https://gerrit.wikimedia.org/r/319789 [07:36:27] <_joe_> and then I'm renaming that fucking module :P [07:37:04] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319789 (owner: 10Giuseppe Lavagetto) [07:37:05] ah ok [07:37:05] haha [07:39:10] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:42:22] (03PS1) 10Yuvipanda: labs: Typo fixed to instance dumper code [puppet] - 10https://gerrit.wikimedia.org/r/319790 [07:53:19] (03CR) 10MarcoAurelio: [C: 031] I had this on my mind already, but always forgot to even ask why this was there. I agree this 'bots' usergroup makes no sense and if 'bot' needs 'skipcaptcha' we should simply add that right to the bot group instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [07:54:20] (03PS1) 10Yuvipanda: docker: Allow setting an apt proxy inside the docker image [puppet] - 10https://gerrit.wikimedia.org/r/319792 [07:56:46] (03PS2) 10Yuvipanda: labs: Typo fixed to instance dumper code [puppet] - 10https://gerrit.wikimedia.org/r/319790 [07:56:48] (03PS1) 10Yuvipanda: docker: Add set -e to build-base-images script [puppet] - 10https://gerrit.wikimedia.org/r/319793 [07:57:00] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319790 (owner: 10Yuvipanda) [07:57:13] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319793 (owner: 10Yuvipanda) [08:03:40] PROBLEM - HHVM rendering on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [08:04:15] Nov 04 08:02:01 mw1190 CRON[10036]: (root) CMD (/usr/local/bin/hhvm-needs-restart > /dev/null && /usr/local/sbin/run-no-puppet /usr/local/bin/restart-hhvm > /dev/null) [08:04:40] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 77266 bytes in 0.088 second response time [08:06:00] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:27] (03PS1) 10Muehlenhoff: Retroactively add CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/319795 [08:11:12] (03CR) 10Muehlenhoff: [C: 032] [debs/linux44] - 10https://gerrit.wikimedia.org/r/319795 (owner: 10Muehlenhoff) [08:11:52] (03PS1) 10Yuvipanda: labs: Kill instance_info_dumper cron [puppet] - 10https://gerrit.wikimedia.org/r/319796 [08:12:08] (03CR) 10Yuvipanda: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319796 (owner: 10Yuvipanda) [08:17:51] (03PS1) 10Muehlenhoff: Add another reference to a recently assigned CVE ID which is already fixed [debs/linux44] - 10https://gerrit.wikimedia.org/r/319797 [08:18:54] (03CR) 10Muehlenhoff: [C: 032] [debs/linux44] - 10https://gerrit.wikimedia.org/r/319797 (owner: 10Muehlenhoff) [08:20:25] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2770912 (10Gilles) [08:23:10] 06Operations, 06Performance-Team, 10Thumbor: Ask firejail upstream about ability to turn off pid namespacing - https://phabricator.wikimedia.org/T149981#2770927 (10Gilles) [08:32:08] 06Operations, 06Labs, 07Tracking: Add config option in tools webservice debian package to write logs to /dev/null - https://phabricator.wikimedia.org/T149946#2770945 (10yuvipanda) Let's do this rather than make it configurable, since doing config via puppet in kubernetes land is kinda not the easiest. So our... [08:32:19] (03PS1) 10Yuvipanda: Route all logs to /dev/null [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/319798 (https://phabricator.wikimedia.org/T149946) [08:32:22] PROBLEM - HHVM rendering on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [08:33:02] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:33:16] (03PS1) 10Marostegui: mariadb: Added gtid_domain_id option [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) [08:33:22] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 77278 bytes in 0.121 second response time [08:42:22] 06Operations, 06Performance-Team, 10Thumbor: Ask firejail upstream about ability to turn off pid namespacing - https://phabricator.wikimedia.org/T149981#2770952 (10Gilles) https://github.com/netblue30/firejail/issues/892 [08:45:02] (03CR) 10Jcrespo: [C: 04-1] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:46:12] (03PS1) 10Gilles: Prevent Thumbor from creating files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/319802 (https://phabricator.wikimedia.org/T145878) [08:49:52] I was going to reimage m2-codfw, lets test there, marostegui [08:50:16] jynus: Sure, I found a small typo anyways, so I am going to submit the patch and then reply to your comments :) [08:51:52] (03PS2) 10Marostegui: mariadb: Added gtid_domain_id option [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) [08:51:59] jynus: I do not want to deploy this anyways, I was just trying to advance a bit on this ticket in the meantime [08:52:09] oh, I think we can deploy [08:52:13] But you had a very good point with the collision [08:52:35] I think you should be bolder, on a smaller scope [08:54:16] (03CR) 10Marostegui: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:55:37] (03CR) 10Marostegui: (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:56:27] (03CR) 10Jcrespo: I know the dbstores used to aggregate different shards, and dbstores are the #1 reason why we deploy this- just check it. I am also thinking on unnormal states, such as a rename. They are strange, but it could happen. I only ask to think if error is possible. [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:00:44] (03CR) 10Jcrespo: Another example: labsdb1003 and db1003 and dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:01:36] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [09:02:36] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.028 second response time [09:02:40] (03PS2) 10Giuseppe Lavagetto: docker: Allow setting an apt proxy inside the docker image [puppet] - 10https://gerrit.wikimedia.org/r/319792 (owner: 10Yuvipanda) [09:03:50] (03CR) 10jenkins-bot: [V: 04-1] docker: Allow setting an apt proxy inside the docker image [puppet] - 10https://gerrit.wikimedia.org/r/319792 (owner: 10Yuvipanda) [09:04:26] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2770979 (10Paladox) The bot should now automatically try and reconnect to ssh. We will see if this works with prod gerrit when it... [09:07:34] (03CR) 10Marostegui: Maybe we can use the @server_id. [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:10:22] !log Reimage db2034 - T149553 [09:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:29] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [09:10:45] (03PS3) 10Giuseppe Lavagetto: docker: Allow setting an apt proxy inside the docker image [puppet] - 10https://gerrit.wikimedia.org/r/319792 (owner: 10Yuvipanda) [09:10:47] (03PS1) 10Muehlenhoff: Bump the kernel ABI to 3 (caused by posix ACL changes in 4.4.29) [debs/linux44] - 10https://gerrit.wikimedia.org/r/319803 [09:11:29] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2771012 (10Gilles) So, IM isn't being smart or anything, setting a limit makes it bail and error instead of trying to work within that limit: ```... [09:12:16] (03CR) 10Jcrespo: [C: 04-1] Just remove the file. [puppet] - 10https://gerrit.wikimedia.org/r/318450 (owner: 10Dzahn) [09:13:10] 06Operations, 06Performance-Team, 10Thumbor: Limit IM engine execution time - https://phabricator.wikimedia.org/T149985#2771027 (10Gilles) [09:13:36] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2771044 (10Mvolz) Would maybe a config option for this be of use? Then maybe we should run benchmarks or something? Another thing that might help is that if we try... [09:13:43] 06Operations, 10Ops-Access-Requests: Access to fluorine for viewing logs (wm-log-reader) - https://phabricator.wikimedia.org/T149832#2764536 (10MoritzMuehlenhoff) Hi, there's some missing information in this access request, could you please go through https://wikitech.wikimedia.org/wiki/Production_shell_access... [09:13:46] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2771048 (10Mvolz) a:03Mvolz [09:13:58] 06Operations, 10ops-eqiad, 06DC-Ops: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#2771049 (10hashar) [09:15:56] (03PS1) 10Jcrespo: mariadb: Enable unix_socket authentication on misc servers (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/319804 (https://phabricator.wikimedia.org/T146149) [09:16:49] 06Operations, 06Performance-Team, 10Thumbor: Limit IM engine execution time - https://phabricator.wikimedia.org/T149985#2771073 (10Gilles) Actually, it doesn't seem like the time limit is exposed through MagickSetResourceLimit. Environment variables are probably the way to go. [09:16:54] <_joe_> jenkins halted again? [09:17:15] <_joe_> just dog slow [09:17:19] (03CR) 10Giuseppe Lavagetto: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319792 (owner: 10Yuvipanda) [09:22:46] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:58] 06Operations, 06Performance-Team, 10Thumbor, 15User-Joe: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2771081 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:23:12] 06Operations, 13Patch-For-Review: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543#2771082 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:24:43] (03PS1) 10Urbanecm: Allow local sysops to add accountcreator group in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319805 (https://phabricator.wikimedia.org/T149986) [09:24:49] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2771101 (10MoritzMuehlenhoff) a:03Cmjohnson [09:26:25] <_joe_> !log rebooting copper to allow enabling the memory cgroup [09:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:06] (03PS1) 10Jcrespo: Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) [09:28:29] 06Operations, 06Performance-Team, 10Thumbor: Limit IM engine execution time - https://phabricator.wikimedia.org/T149985#2771113 (10Gilles) Neat, the environment variables work as expected and result in a graceful error at the Thumbor level: ``` thumbor: ERROR: ERROR: Traceback (most recent call last): Fi... [09:29:05] 06Operations, 06Performance-Team, 10Thumbor: Set IM Thumbor engine environment variables - https://phabricator.wikimedia.org/T149985#2771115 (10Gilles) [09:30:03] marostegui, I plan to reimage misc db2011 [09:30:16] maybe we can try a first domain id change there? [09:30:23] PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [09:30:37] (03PS2) 10Jcrespo: mariadb: Enable unix_socket authentication on misc servers (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/319804 (https://phabricator.wikimedia.org/T146149) [09:30:43] PROBLEM - Apache HTTP on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [09:30:48] jynus: sure [09:31:03] jynus: db2011 is the one that had RAID issues a few days ago (it is all fine now, just fyi) [09:31:23] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 77232 bytes in 0.101 second response time [09:31:43] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time [09:32:13] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [09:32:53] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:36:05] 06Operations, 06Performance-Team, 10Thumbor, 15User-Joe: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2756079 (10Gilles) The logs can be misleading, while the rsvg-convert command failed, Thumbor correctly responded with a 500 and kept runni... [09:36:39] (03PS3) 10MarcoAurelio: Rename 'autopatrol' to 'autopatrolled' on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308446 (https://phabricator.wikimedia.org/T144699) [09:37:45] (03CR) 10Muehlenhoff: [C: 032] [debs/linux44] - 10https://gerrit.wikimedia.org/r/319803 (owner: 10Muehlenhoff) [09:40:13] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:42:20] (03PS1) 10Gilles: Set environment variables for ImageMagick running inside Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/319807 (https://phabricator.wikimedia.org/T149985) [09:46:52] (03PS1) 10Urbanecm: Allow reviewers to stabilize pages in Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319808 (https://phabricator.wikimedia.org/T149987) [09:49:12] (03PS4) 10Urbanecm: Add possibility to disable CompactLink in default state and disable it on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298187 (https://phabricator.wikimedia.org/T139903) [09:49:18] (03PS5) 10Urbanecm: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) [09:49:44] (03PS2) 10Urbanecm: Enable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319566 (https://phabricator.wikimedia.org/T149899) [09:49:47] (03PS2) 10Urbanecm: New user right and user group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319568 (https://phabricator.wikimedia.org/T149610) [09:49:49] (03PS2) 10Urbanecm: Allow local sysops to add accountcreator group in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319805 (https://phabricator.wikimedia.org/T149986) [09:49:51] (03PS2) 10Urbanecm: Allow reviewers to stabilize pages in Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319808 (https://phabricator.wikimedia.org/T149987) [09:51:08] (03PS1) 10Marostegui: mariadb: Added gtid_domain_id option for m1 [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) [09:51:43] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:53:00] (03PS2) 10Marostegui: mariadb: Added gtid_domain_id option for m2 [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) [09:54:36] !log upgrading memcached on jessie graphite systems [09:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:42] (03CR) 10Jcrespo: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:59:07] !log upgrading memcached on swift frontend servers in codfw [09:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:17] PROBLEM - Apache HTTP on mw1225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [09:59:27] PROBLEM - HHVM rendering on mw1225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [09:59:36] (03CR) 10Marostegui: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:59:57] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:00:17] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.024 second response time [10:00:27] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 77233 bytes in 0.192 second response time [10:00:44] !log stopping db2011 for backup and reimage [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] ^proxy will complain while it is down, but I need to see only 1 server fails, and not both [10:02:12] jynus: I did a backup of db2011 on wednesday before the RAID thing: dbstore2001:/srv/tmp/db2011.tar.gz.enc [10:02:21] jynus: but if you want to do one, that is also fine, just saying :) [10:02:40] well, I did a backup of m2 too (I do one every week) [10:02:59] and with me, I mean the backup system :-P [10:03:01] (03CR) 10Marostegui: check [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:03:12] XDD [10:03:22] but I want to keep things as they are [10:03:32] and have less lag time later [10:03:42] too many backups is never a bad thing [10:04:00] sí [10:05:26] not sure 'sí' means the same thing than 'yes' on this context [10:05:37] haha [10:06:09] but it may be castilian Spanish only [10:07:17] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:07:37] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:08:19] (03Abandoned) 10Marostegui: mariadb: Added gtid_domain_id option [puppet] - 10https://gerrit.wikimedia.org/r/319799 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:09:41] ^all tottaly normal and expected [10:11:35] (03PS3) 10Marostegui: mariadb: Added gtid_domain_id option for m2 [puppet] - 10https://gerrit.wikimedia.org/r/319809 (https://phabricator.wikimedia.org/T149418) [10:15:12] (03PS2) 10Jcrespo: Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) [10:15:36] (03CR) 10jenkins-bot: [V: 04-1] Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:16:37] (03PS3) 10Jcrespo: mariadb: Enable unix_socket authentication on misc servers (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/319804 (https://phabricator.wikimedia.org/T146149) [10:16:59] (03CR) 10Jcrespo: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319804 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [10:19:17] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:23] !log upgrading memcached on swift frontend servers in esams [10:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:35] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: allow incoming connections [puppet] - 10https://gerrit.wikimedia.org/r/319814 [10:22:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319814 (owner: 10Giuseppe Lavagetto) [10:23:18] <_joe_> jynus: ok to merge your change? [10:24:47] PROBLEM - Memcached on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 11211: Connection refused [10:25:08] Hey… was October 5th the date of any major breakage? [10:25:17] PROBLEM - Memcached on ms-fe3002 is CRITICAL: connect to address 10.20.0.16 and port 11211: Connection refused [10:25:40] Lots of broken transcodes on commons from that date (videos) that work fine when thrown back in the queuue. [10:26:01] <_joe_> Revent: I honestly don't remember, let me check [10:26:27] _joe_, yes [10:26:47] RECOVERY - Memcached on ms-fe3001 is OK: TCP OK - 0.084 second response time on 10.20.0.15 port 11211 [10:26:47] <_joe_> jynus: {{done}} [10:26:49] I was trying to merge manually the other one [10:27:01] git doesn't like 3-way merge [10:27:10] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2771203 (10mobrovac) I think the base problem here is that Citoid strongly depends on the DB. As you point out, we use it throughout Citoid. What would be the direc... [10:27:17] RECOVERY - Memcached on ms-fe3002 is OK: TCP OK - 0.084 second response time on 10.20.0.16 port 11211 [10:27:29] <_joe_> Revent: nothing relevant enough to make it to the incident documentation, I guess [10:27:31] PROBLEM - Apache HTTP on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [10:27:31] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [10:27:41] Might just have been a glitch. [10:28:21] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.030 second response time [10:28:21] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 77279 bytes in 0.349 second response time [10:28:37] A fair number of them are in the Special:TimedMediaHandler error list, but show as ‘okay’ on the file page, but are clearly ‘not’ okay. [10:29:05] (but like I said, work fine when thrown back in)…. \o/ [10:29:37] <_joe_> Revent: to be honest, we don't have great monitoring on the videoscaling infrastructure [10:29:50] <_joe_> one of the thousands of things on my todo lists :( [10:30:59] Yeah, I noted a third of a million entries in the ‘broken’ list, decided it was a good time of day (probably low load) to plug at it a bit. [10:31:24] (not like I can’t do other stuff while waiting on them to cook) [10:32:22] Actuallly… got a moment to look? [10:32:44] (03PS3) 10Jcrespo: Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) [10:32:59] https://commons.wikimedia.org/wiki/File:President_Obama_and_Prime_Minister_Trudeau_Deliver_Remarks_at_State_Dinner.webm <- someone elses ‘new’ upload, and one of the transcodes just errored out. [10:33:13] <_joe_> Revent: I can look at the logs, yes [10:33:17] it took me more time than it should because git additions broke gerrit review expectations [10:33:40] Since it was ‘just now’ (and looks odd, since others are working) might give an indication. [10:33:49] (03CR) 10jenkins-bot: [V: 04-1] Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:34:30] <_joe_> Revent: not necessarily [10:34:36] <_joe_> but yeah I can take a look for sure [10:35:15] (nods) Just figured since it was ‘fresh’, it wouldn’t be evidence of problems from a month ago, lol. [10:35:20] <_joe_> Revent: you quite overloaded the videoscalers I'd say :) [10:35:54] <_joe_> https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name :P [10:35:59] (03CR) 10Jcrespo: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:36:08] Oh, I figured it would hold them in the ‘cache’ until it had space. [10:36:16] <_joe_> since it's all async processing, it's not an issue at all [10:36:18] Oh, wow. [10:36:28] <_joe_> they just max out the cpu [10:36:35] <_joe_> and the risk is the hit a timeout [10:36:51] <_joe_> *they [10:36:59] (nods) Gotcha, that might be the issue… I’ll slow down, and keep an eye on that. [10:37:20] (I mean, timeout might be the ‘whole’ issue) [10:43:44] (03PS4) 10Jcrespo: Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) [10:44:21] _joe_: TYVM, btw, for pointing me at the right load graph… (bookmarks) [10:44:44] <_joe_> Revent: you're welcome :) [10:44:59] <_joe_> Revent: also, that url might change in the near future, so don't thank me :P [10:45:10] (03CR) 10jenkins-bot: [V: 04-1] Allow SSL (TLS) and performance_schema on misc servers [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:45:15] I’ll just come in here and yell. :P [10:45:24] (03CR) 10Jcrespo: [V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:46:17] jenkins doesn't let me merge [10:46:21] <_joe_> Revent: I found https://gerrit.wikimedia.org/r/#/c/314433/ which seems to indicate there was some issue with a released feature on oct 5 [10:46:28] <_joe_> jynus: ? [10:46:29] even if I override [10:46:32] <_joe_> what's up? [10:46:38] the clearly bogus test [10:46:48] Not Verified [10:46:53] <_joe_> 10:45:08 ./modules/profile/manifests/docker/registry.pp:63 WARNING indentation of => is not properly aligned (arrow_alignment) [10:46:55] Verified [10:46:55] -1 jenkins-bot [10:46:57] <_joe_> that's my failure [10:46:58] +2 Jcrespo [10:46:59] <_joe_> fixing [10:47:01] <_joe_> sorry [10:47:03] I do not care about that [10:47:06] I care about gerrit [10:47:14] TimedText seems ‘related’, even if not ‘involved’ [10:47:15] not lettin me merge [10:47:16] <_joe_> let me see [10:47:26] (03CR) 10Giuseppe Lavagetto: [V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319806 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:47:56] https://gerrit.wikimedia.org/r/319806 [10:47:58] <_joe_> jynus: removed jenkins-bot as a reviewer [10:48:01] <_joe_> now you can merge [10:48:08] <_joe_> that's the old trick I always used [10:48:18] ok, thanks [10:48:21] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:48:23] the fix can wait [10:48:26] no issue [10:48:26] (03PS1) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [10:48:28] <_joe_> I'll fix my fuckup now [10:49:38] (03PS1) 10Giuseppe Lavagetto: docker::registry: fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/319821 [10:49:40] (03CR) 10jenkins-bot: [V: 04-1] Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 (owner: 10Muehlenhoff) [10:49:49] but I thought +2 invalidated a -1 [10:49:56] on verified [10:50:00] <_joe_> jynus: not really, you can vote in its place [10:50:11] ok, then I got it wrong [10:50:24] <_joe_> the trick is to remove jenkins-bot from reviewers in those cases [10:50:28] my fault [10:50:29] then [10:51:50] (03CR) 10Giuseppe Lavagetto: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319821 (owner: 10Giuseppe Lavagetto) [10:55:21] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:21] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:31] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:51] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:41] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [10:57:06] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:57:07] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [10:57:16] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:58:54] !log installing tar security updates [10:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:14] (03PS1) 10Ema: cache_text: route ulsfo to codfw [puppet] - 10https://gerrit.wikimedia.org/r/319823 (https://phabricator.wikimedia.org/T131503) [11:16:06] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:16] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:26] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:26] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:38] (03PS1) 10Ema: cache_text: upgrade ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319824 (https://phabricator.wikimedia.org/T131503) [11:19:16] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:16] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:26] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:46] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:44] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2771253 (10Marostegui) dbstore2002 looks good, stopping and starting slaves, the mysqld process and so forth shows no errors. I have seen that the tokudb plugin cannot be loaded ``` 161104 11:15... [11:22:06] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:22:06] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:22:16] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:22:36] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:23:56] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [11:24:06] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:24:16] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:24:16] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:24:55] 06Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964#2771254 (10MoritzMuehlenhoff) a:03Cmjohnson [11:25:13] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2771256 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:25:16] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:16] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:26] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:27] 06Operations, 10Ops-Access-Requests: Access to fluorine for viewing logs (wm-log-reader) - https://phabricator.wikimedia.org/T149832#2771257 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:25:38] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#2771259 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:25:46] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:16] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[quickstack] [11:27:06] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:27:06] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:27:16] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:27:42] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:28:54] <_joe_> what's up with citoid? [11:29:02] <_joe_> anyone took a look? [11:29:11] <_joe_> it seems just sever perf degradation [11:30:06] 06Operations, 06Discovery, 06Maps: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2771261 (10Gehel) Other strange thing: If I sum the request rates for all zoom levels (as [[ https://graphite.wikimedia.org/S/Bu | reported by kartotherian ]])... [11:31:12] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:22] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:32] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:32] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:47] i'm betting this is zotero _joe_ ^ [11:32:59] * mobrovac restarting zotero [11:33:51] !log restarting zotero [11:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:03] <_joe_> mobrovac: looking at the dashboard it seems just more requests than usual [11:34:51] _joe_: actually https://grafana-admin.wikimedia.org/dashboard/db/service-citoid [11:36:52] PROBLEM - DPKG on xenon is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:37:04] ? [11:37:58] mobrovac: that's fine, currently upgrading something [11:38:10] that icinga check is a little trigger-happy [11:39:02] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [11:39:12] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:39:22] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:39:22] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:39:24] kk thnx moritzm [11:40:14] (03CR) 10Volans: Filippo, thanks a lot for the quick fixes. (035 comments) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [11:40:16] _joe_: i think that was zotero, looking at the dashboard the number of requests is more or less the same, but now zotero isn't eating 20% of mem [11:40:26] <_joe_> sigh [11:40:42] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [11:40:52] RECOVERY - DPKG on xenon is OK: All packages OK [11:41:42] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:48:31] 06Operations, 05Prometheus-metrics-monitoring: prometheus-node-exporter package should use a systemd override - https://phabricator.wikimedia.org/T149992#2771267 (10MoritzMuehlenhoff) [11:50:22] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:22] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:29] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#2771280 (10akosiaris) [11:50:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:52] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:11] (03PS1) 10Ema: cache_text varnishtest: proper caching of mangled URLs [puppet] - 10https://gerrit.wikimedia.org/r/319829 (https://phabricator.wikimedia.org/T131503) [11:54:12] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:57:13] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:57:13] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:57:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:57:43] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [12:01:25] (03PS1) 10Jcrespo: Enable ssl (TLS) on misc database servers [puppet] - 10https://gerrit.wikimedia.org/r/319831 (https://phabricator.wikimedia.org/T111654) [12:03:49] (03CR) 10Jcrespo: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319831 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [12:11:23] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:33] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:33] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:34] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:13] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:14:23] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:14:23] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [12:14:23] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [12:22:53] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4008 is CRITICAL: connect to address 10.128.0.108 and port 3128: Connection refused [12:23:23] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [12:23:33] PROBLEM - Apache HTTP on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [12:24:23] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 77266 bytes in 0.086 second response time [12:24:33] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.023 second response time [12:26:20] 06Operations: Integrate jessie 8.3 point update - https://phabricator.wikimedia.org/T124647#2771320 (10MoritzMuehlenhoff) 05Open>03Resolved This is rolled out completely across our jessie systems (this was 99% percent done and I fixed a up a few missing hosts today) [12:31:26] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:36] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:36] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:36] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:16] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:34:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:34:26] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [12:34:26] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [12:35:46] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4008 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.157 second response time [12:37:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 601 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3056592 keys, up 4 days 4 hours - replication_delay is 601 [12:55:06] RECOVERY - check_disk on bismuth is OK: DISK OK - free space: / 34373 MB (64% inode=88%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 7988 MB (99% inode=99%): /run 1599 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7999 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 182 MB (73% inode=99%): /a 384415 MB (99% inode=99%) [12:59:43] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#2771355 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:04:52] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:04:52] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:05:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:42] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:07:42] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [13:07:52] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [13:08:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:08:58] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#2771360 (10elukey) piwik_2.16.0-1_all.deb seems to be in @ori's home :P http://debian.piwik.org/ contains the latest debs, maybe we could upload it to third-party? [13:11:10] _joe_: Well, I think if there was any dust accumulated on the video scaler’s heatsinks it’s gone by now… :/ [13:11:19] <_joe_> Revent: eheh [13:11:52] I, really, did not thik it would actually try to do them all ‘at once’, lol. [13:12:18] <_joe_> Revent: it should not, I have to check what is misconfigured there anyways [13:12:30] <_joe_> probably we allow too many transcoding to go on at the same time [13:13:20] Part of it, tho.. I did not quite look at exactly ‘what’ they were, other than broken transcodes. [13:13:45] They are 1080p copies of Obama’s State of the Union addresses, lol. [13:15:16] So I kinda threw like… 10, 15 gig of stuff at it all at once… :/ [13:16:59] But… servers are obviously stable, lol. [13:18:52] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:18:52] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:22] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:42] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:21:42] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [13:21:52] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [13:22:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:22:55] (03PS2) 10Ema: cache_text varnishtest: proper caching of mangled requests [puppet] - 10https://gerrit.wikimedia.org/r/319829 (https://phabricator.wikimedia.org/T131503) [13:24:38] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771447 (10elukey) The issue re-appeared again in upload today (early UTC morning time), all concentrated in `ulsfo`. I managed to cap... [13:27:28] 06Operations: Not all packages from packages::statistics are available on jessie - https://phabricator.wikimedia.org/T150003#2771469 (10MoritzMuehlenhoff) [13:29:12] (03PS1) 10Muehlenhoff: statistics::packages: Use emacs instead of emacs23 [puppet] - 10https://gerrit.wikimedia.org/r/319839 (https://phabricator.wikimedia.org/T150003) [13:36:07] (03CR) 10Muehlenhoff: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319616 (owner: 10Ema) [13:36:36] (03CR) 10BBlack: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319823 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:36:49] (03CR) 10BBlack: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319824 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:38:18] (03PS3) 10Andrew Bogott: Designate nova_fixed_multi plugin: avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) [13:38:27] (03CR) 10BBlack: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319829 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:39:04] (03CR) 10Ema: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319829 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:41:30] (03PS2) 10Ema: cache_text: route ulsfo to codfw [puppet] - 10https://gerrit.wikimedia.org/r/319823 (https://phabricator.wikimedia.org/T131503) [13:41:38] (03CR) 10Ema: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319823 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:41:40] (03PS4) 10Andrew Bogott: Designate nova_fixed_multi plugin: avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) [13:43:35] (03CR) 10Andrew Bogott: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) (owner: 10Andrew Bogott) [13:45:02] (03CR) 10Elukey: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319839 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [13:45:08] (03PS5) 10Andrew Bogott: Designate nova_fixed_multi plugin: avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) [13:46:41] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771535 (10elukey) The following request takes ages to complete on `cp400[67]` but it completes very quickly on `cp1099`: ``` curl "h... [13:47:26] PROBLEM - HHVM rendering on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [13:48:26] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 77283 bytes in 0.075 second response time [13:53:21] (03PS2) 10Muehlenhoff: statistics::packages: Use emacs instead of emacs23 [puppet] - 10https://gerrit.wikimedia.org/r/319839 (https://phabricator.wikimedia.org/T150003) [13:56:04] (03CR) 10Muehlenhoff: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319839 (https://phabricator.wikimedia.org/T150003) (owner: 10Muehlenhoff) [13:57:11] (03PS1) 10Jcrespo: mariadb: Install jessie on db2011 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/319847 [13:57:15] (03PS2) 10Ema: cache_text: upgrade ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319824 (https://phabricator.wikimedia.org/T131503) [13:57:26] (03PS2) 10Jcrespo: mariadb: Install jessie on db2011 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/319847 [13:57:28] (03CR) 10Ema: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319824 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:57:59] ema: ok to merge your change along? [13:58:20] yep :) [13:59:10] FYI, image scaler load is starting to visibly scale back downward over the last half hour or so…. finally. [14:00:29] (03PS1) 10Alexandros Kosiaris: package_builder: Add dh-golang in the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/319850 [14:00:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3046710 keys, up 4 days 5 hours - replication_delay is 0 [14:00:37] Revent: what is this related to? [14:01:15] image scaler issues? [14:01:36] we're also trying to debug, separately, a case of an odd hanging request on the upload caches. if there's something wrong with scalers... [14:01:45] bblack: Umm… well, I kinda discovered that if you throw the broken transcodes of a half-dozen or so hour-long 1080p videos back in the queue, the cervers try to do them all at the same time. [14:02:04] (03PS3) 10Jcrespo: mariadb: Install jessie on db2011 (m2) [puppet] - 10https://gerrit.wikimedia.org/r/319847 [14:02:21] ... [14:02:36] bblack: It’s not a ‘scaler problem’, other than probably a good idea to tweak the max number of transcodes to try to do at the same time down a bit. :P [14:03:03] well I've got a thumbnail request I'm trying to debug, which is timing out [14:03:14] it seems unlikely these two things are unrelated :) [14:03:16] bblack, there is some backlong with joe, probably not related [14:03:40] I kinda assumed that the ‘queued transcodes’ list actually operated as, you know, a ‘queue’ that got served as load allowed. [14:03:48] (03CR) 10Jcrespo: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319847 (owner: 10Jcrespo) [14:04:01] you assume a lot! :) [14:04:05] he he [14:05:45] bblack: What image are you trying to get re-thumbed? New uploads (the typical small stuff) have been going through the queue even while it’s been loaded down, just slower. [14:06:11] I'm not trying to get anything re-thumbed, and I don't even know that it's a legit original [14:06:25] https://commons.wikimedia.org/w/thumb_handler.php/6/63/Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg/720px-Taissa-Farmiga--2014-Primetime-Emmy-Awards--06_%281%29.jpg [14:06:36] ^ requests from this are found in logs, and they're claiming there's no original [14:06:50] but in some cases, it's just hanging, too [14:06:50] Ah... [14:07:51] 12:20, 2016 April 25 Storkk (A) (talk | contribs | block) deleted page File:Taissa-Farmiga--2014-Primetime-Emmy-Awards--06 (1).jpg (Copyright violation; see COM:L - Using VisualFileChange.) (view/restore) (global usage; delinker log) [14:08:53] ok [14:09:01] I've still got the hang to sort out, but now the hang is gone [14:09:24] godog: incidentally, apparently swift sets "Cache-Control: no-cache" on 404s from renderers? [14:10:01] bblack: However, the redlinked entries at the bottom of https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [14:10:21] (03PS2) 10Alexandros Kosiaris: package_builder: Add dh-golang in the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/319850 [14:10:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319850 (owner: 10Alexandros Kosiaris) [14:10:49] You’d think that queued thumb/transcode of deleted stuff would go away... [14:10:59] godog: (seems like a bad idea. varnish does limit 4xx to 10 minutes regardless, but some caching of the 404 would be nice) [14:14:26] !log upgrading cp4008 (text-ulsfo) to varnish 4 -- T131503 [14:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:32] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [14:15:15] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2771646 (10debt) p:05Triage>03High @Slaporte - this is interesting and might affect what the Foundation decides... > Some discussions about the equivalent OSM usa... [14:24:03] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2239664 (10GWicke) > An obvious idea that comes to mind is to lower the TCP socket connection time-out, but AFAIK we can change that only system-wide which could ha... [14:38:02] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2771714 (10Andrew) a:03Andrew [14:38:57] 06Operations, 10ops-codfw, 10DBA: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2771723 (10Marostegui) [14:41:36] (03PS1) 10Marostegui: mariadb: Install jessie on dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/319859 (https://phabricator.wikimedia.org/T150017) [14:45:51] /msg nickserv set enforce ON [14:46:05] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:37] (03CR) 10Andrew Bogott: [C: 031] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [14:49:42] !log upgrading cp4009 (text-ulsfo) to varnish 4 -- T131503 [14:49:44] (03CR) 10Jcrespo: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319859 (https://phabricator.wikimedia.org/T150017) (owner: 10Marostegui) [14:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [14:50:15] (03CR) 10Marostegui: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319859 (https://phabricator.wikimedia.org/T150017) (owner: 10Marostegui) [14:53:05] PROBLEM - Varnishkafka log producer on cp4009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:53:23] that's me^ [14:53:56] (03PS2) 10Andrew Bogott: labsprojectfrommetadata: use jq instead of trying to parse JSON with regex [puppet] - 10https://gerrit.wikimedia.org/r/307660 (owner: 10Alex Monk) [14:54:05] RECOVERY - Varnishkafka log producer on cp4009 is OK: PROCS OK: 3 processes with command name varnishkafka [14:55:37] (03CR) 10Andrew Bogott: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/307660 (owner: 10Alex Monk) [14:56:25] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:32] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319598 (https://phabricator.wikimedia.org/T149638) (owner: 10Reedy) [14:59:08] (03Merged) 10jenkins-bot: Elevate password policies for all users on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319598 (https://phabricator.wikimedia.org/T149638) (owner: 10Reedy) [14:59:28] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319000 (https://phabricator.wikimedia.org/T121186) (owner: 10Reedy) [15:00:02] (03Merged) 10jenkins-bot: Increase password requirements on enwiki for "Abuse filter editors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319000 (https://phabricator.wikimedia.org/T121186) (owner: 10Reedy) [15:00:21] (03PS2) 10Reedy: Update minimum bot password length to 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318951 (https://phabricator.wikimedia.org/T104145) [15:00:25] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318951 (https://phabricator.wikimedia.org/T104145) (owner: 10Reedy) [15:01:09] (03Merged) 10jenkins-bot: Update minimum bot password length to 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318951 (https://phabricator.wikimedia.org/T104145) (owner: 10Reedy) [15:02:14] (03CR) 10Hashar: recheck [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/316305 (https://phabricator.wikimedia.org/T148363) (owner: 10Gilles) [15:02:29] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: stage wmgElevateDefaultPasswordPolicy (duration: 00m 48s) [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:04:24] (03PS4) 10Reedy: Enable OATHAuth on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) (owner: 10Arseny1992) [15:04:28] 06Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#2771808 (10Papaul) 05Open>03Resolved Disk has been replaced [15:04:34] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Raise password requirements for private wikis, Abuse filter editors on enwiki, and make minimum bot password length to 8 (duration: 00m 47s) [15:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:52] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) (owner: 10Arseny1992) [15:05:33] (03Merged) 10jenkins-bot: Enable OATHAuth on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) (owner: 10Arseny1992) [15:06:40] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable OATHAuth on all private wikis (duration: 00m 49s) [15:06:42] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2771816 (10Papaul) I think we need to have labstore2001 back up running on H800 controller a for now and work on making the 3ware controller... [15:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] (03PS2) 10Reedy: Move and simplify some wikitech specific config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319600 [15:07:28] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319600 (owner: 10Reedy) [15:08:13] (03Merged) 10jenkins-bot: Move and simplify some wikitech specific config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319600 (owner: 10Reedy) [15:08:30] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 (owner: 10Reedy) [15:10:35] Oh, gak… the ‘newer’ entries showing up the the broken transcodes list on Commons are ‘also’ gigabyte-size 1080p files uploaded on October 5th. [15:11:38] Well.. and a bunch of truly broken ‘open access’ videos… [15:11:49] (03PS2) 10Reedy: Load OATHAuth on wikitech same as other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319601 [15:12:03] (03PS1) 10BBlack: VCL: promote general hit-for-pass to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/319869 [15:13:01] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319601 (owner: 10Reedy) [15:13:07] (03PS2) 10Reedy: Remove commented OpenID config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 [15:13:12] (03CR) 10Reedy: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 (owner: 10Reedy) [15:13:15] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 (owner: 10Reedy) [15:13:37] (03Merged) 10jenkins-bot: Load OATHAuth on wikitech same as other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319601 (owner: 10Reedy) [15:13:44] (03Merged) 10jenkins-bot: Remove commented OpenID config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 (owner: 10Reedy) [15:15:09] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Normalise wikitech OATHAuth loading config (duration: 00m 48s) [15:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] !log reedy@tin Synchronized wmf-config/wikitech.php: Stop double loading OATHAuth now, remove commented config (duration: 00m 47s) [15:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:28] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Simplify some wikitech config (duration: 00m 47s) [15:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:55] (03PS2) 10Reedy: wikitech: remove 'bots' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [15:18:00] (03CR) 10Reedy: [C: 032] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [15:18:35] (03Merged) 10jenkins-bot: wikitech: remove 'bots' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [15:19:07] (03PS4) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [15:19:14] papaul: Hi! are you the DC today, and if so would you like to continue connecting the external shelves to the H800 on 2001? [15:19:38] (03PS5) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 [15:19:52] (we can do it next week too) [15:20:22] (03PS2) 10BBlack: VCL: promote general hit-for-pass to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/319869 [15:20:40] madhuvishy: done already [15:20:49] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Remove wikitech bot group (duration: 00m 47s) [15:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:08] papaul: aah! How many disks are there among the shelves, and are all of them okay? or should i check [15:21:24] 06Operations, 07Puppet, 06Discovery, 06Maps: Refactor puppet-postgresql module to use custom types - https://phabricator.wikimedia.org/T150020#2771841 (10Gehel) [15:21:44] madhuvishy: 48 on H800 and 12 on h700 [15:23:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: WDQS monitoring of response times needs to be adapted now that we use LVS - https://phabricator.wikimedia.org/T148015#2771854 (10Gehel) 05Open>03Resolved This is resolved by https://gerrit.wikimedia.org/r/315651 [15:23:43] (03Abandoned) 10Muehlenhoff: Add versioned dependency on kernel to ensure that latest version is pulled in [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/316781 (owner: 10Muehlenhoff) [15:24:30] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:25:12] (03PS3) 10Ema: VCL: promote general hit-for-pass to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/319869 (owner: 10BBlack) [15:25:18] (03PS1) 10Muehlenhoff: Depend on new ABI name [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/319870 [15:25:53] (03CR) 10Ema: [C: 031] [puppet] - 10https://gerrit.wikimedia.org/r/319869 (owner: 10BBlack) [15:26:51] 06Operations, 10Traffic, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack) [15:27:26] papaul: nice i see all of them on megacli. these 48 disks are split between 4 shelves 12 each? [15:28:26] madhuvishy: yes [15:29:04] (03CR) 10BBlack: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319869 (owner: 10BBlack) [15:29:45] papaul: could you tell me which of these arrays have the shelves? https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1433 (just for my learning) [15:31:05] madhuvishy: sure [15:31:39] madhuvishy: labstore-arrayx-codfw [15:32:37] papaul: I see 4 arrays, and wasn't sure how many shelves each of them had, and which labstore boxes were connected to what [15:32:37] madhuvishy: with x=0,1,2 and 3 [15:33:42] madhuvishy: labstore2001 is connected to array0,1,2 and [15:34:02] madhuvishy: it will be the same if we have labstore2002 up as well [15:34:23] madhuvishy: but for now labstore2002 is connected to any shelves [15:34:37] madhuvishy: labsstore2002 is not connected [15:34:46] (03PS1) 10Hashar: contint: email template for Jenkins notification [puppet] - 10https://gerrit.wikimedia.org/r/319874 (https://phabricator.wikimedia.org/T149996) [15:35:40] papaul: right, okay. 2003 and 4 aren't connected to any external shelves either [15:35:47] !log reimage dbstore2002 - T150017 [15:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:53] T150017: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017 [15:36:04] (03PS1) 10Muehlenhoff: elasticsearch::https: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/319875 [15:36:06] !log set up 4x10G (ae0) links between asw-d-eqiad<->asw2-d-eqiad [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:12] madhuvishy: correct [15:36:26] papaul: so apart from 4 shelves available for labstore2001 with 48 disks altogether, we also have 4 more shelves for 2002 with 48 more disks? [15:37:16] !log upgrading cp4010 (text-ulsfo) to varnish 4 -- T131503 [15:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:21] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [15:37:24] madhuvishy: no labstore2001 and labstore2003 shared the same shelves [15:37:48] sorry labstore2001 and 2002 shared the same shelves [15:37:48] papaul: aah, is that normal? [15:38:00] madhuvishy: ^ [15:38:03] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [15:38:37] papaul: right 2001 and 2002. How are they connected to share shelves? [15:38:43] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:38:53] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [15:39:01] 2002 is not connect right now to anything [15:39:08] (03PS1) 10Muehlenhoff: carbon_pickled: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/319878 [15:39:13] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2771723 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2002.codfw.wmnet'] ``` The log can be found in `/var/log/w... [15:39:24] papaul: so like when failing over manually plug the raid card of 2002 into the shelves? [15:39:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3047732 keys, up 4 days 7 hours - replication_delay is 0 [15:40:03] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/jenkins/email-templates] [15:40:07] madhuvishy: it supposed to work like that [15:40:29] madhuvishy: but right now 2002 is not up [15:40:36] (03CR) 10Hashar: Puppet compile https://puppet-compiler.wmflabs.org/4548/contint1001.wikimedia.org/ [puppet] - 10https://gerrit.wikimedia.org/r/319874 (https://phabricator.wikimedia.org/T149996) (owner: 10Hashar) [15:40:53] puppet is going to fail on contint1001 [15:41:17] papaul: understood [15:41:23] madhuvishy: cool [15:41:55] to fix contint1001 puppet run I could use a merge https://gerrit.wikimedia.org/r/#/c/319874/1 please. It pass compiler just fine :] [15:42:27] madhuvishy: I think i did a diagram of this back i will look for it and share it with you [15:42:42] papaul: that would be great thank you :D [15:42:51] madhuvishy: no problem [15:42:54] <_joe_> hashar: the file you reference to, is it already in puppet? [15:43:05] oh man [15:43:08] * hashar runs git add [15:43:14] <_joe_> ;) [15:43:23] <_joe_> I'm not sure that is a great solution either [15:43:29] (03PS2) 10Hashar: contint: email template for Jenkins notification [puppet] - 10https://gerrit.wikimedia.org/r/319874 (https://phabricator.wikimedia.org/T149996) [15:43:33] <_joe_> what do you use that template for? [15:44:20] generate emails based on a Jenkins build context [15:44:33] <_joe_> and why it's not on the master? [15:44:50] "It comes from integration/jenkins which is not available on the Jenkins master." [15:45:03] <_joe_> I mean won't it be better to deploy the actual integration/jenkins thing to the master? [15:45:08] (03CR) 10Hashar: [C: 031] Forgot to git add in PS 1 :( [puppet] - 10https://gerrit.wikimedia.org/r/319874 (https://phabricator.wikimedia.org/T149996) (owner: 10Hashar) [15:45:17] na it is not needed anymore [15:45:18] <_joe_> what is 'integration/jenkins' btw? a repo? [15:45:21] <_joe_> a package? [15:45:32] the reason integration/jenkins.git was still on gallium is because it hasn't be garbage collected [15:45:50] <_joe_> yeah but why duplicate content when you can just git::clone it? [15:45:51] it a collection of scripts/tools we use on all slaves [15:46:00] <_joe_> I would clone that instead [15:46:30] I am not duplicating content [15:46:34] i dropped the template for the all repo [15:46:41] i dropped the template for the **other** repo [15:46:56] <_joe_> oh ok [15:47:04] <_joe_> then that's ok :) [15:47:14] the file hasn't been edited in a couple years so it is fine to have it in puppet.git :D [15:47:35] (03CR) 10Giuseppe Lavagetto: [C: 032] [puppet] - 10https://gerrit.wikimedia.org/r/319874 (https://phabricator.wikimedia.org/T149996) (owner: 10Hashar) [15:47:43] \o/ [15:47:47] <_joe_> I just don't want it in two different repos [15:47:50] lot of context around sorry :( [15:47:55] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772002 (10faidon) [15:48:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [15:48:21] the removal from the other repo is https://gerrit.wikimedia.org/r/#/c/319873/ [15:49:13] puppet fixed! thank you _joe_ :) [15:49:55] <_joe_> hashar: yw :) [15:50:03] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:50:28] _joe_: oh and the migration to contint1001 went just fine! with no puppet magic involved [15:50:48] the refactoring of zuul to use hiera was definitely a great thing. Thanks for the past reviews [15:50:53] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:57] <_joe_> np "_ [16:00:59] !log upgrading cp4016 (text-ulsfo) to varnish 4 -- T131503 [16:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:06] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [16:01:54] papaul: when I run fdisk -l | grep Disk on labstore2001, I only see 12 1.8TB disks. [16:02:30] i'm not sure if I've to do something on my end to see the rest [16:03:47] PROBLEM - Varnishkafka log producer on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [16:04:47] RECOVERY - Varnishkafka log producer on cp4016 is OK: PROCS OK: 3 processes with command name varnishkafka [16:07:18] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772062 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbstore2002.codfw.wmnet'] ``` and were **ALL** successful. [16:13:14] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772079 (10Marostegui) ``` root@dbstore2002:/srv/sqldata# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.6 (jessie) Release: 8... [16:17:57] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:21:57] madhuvishy: you only see the once connected to H700 hmm.... ok when i am done at the clinic i will take a look in the RAID maanger to see [16:22:46] !log upgrading cp4017 (text-ulsfo) to varnish 4 -- T131503 [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:52] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [16:25:47] PROBLEM - Varnishkafka log producer on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [16:26:47] RECOVERY - Varnishkafka log producer on cp4017 is OK: PROCS OK: 3 processes with command name varnishkafka [16:27:20] 06Operations, 10Traffic, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772092 (10fgiunchedi) AFAICS it is `thumb_handler.php` from MW generating CC: no-cache on 404s and then proxied back to varnish by swift's `rewrite.py`, e.g. on m... [16:30:18] papaul: ah yeah alright [16:38:12] 06Operations, 10Traffic, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772108 (10BBlack) When I hit a renderer directly, I get: ``` bblack@cp1099:~$ curl "http://rendering.svc.eqiad.wmnet/wikipedia/commons/thumb/6/63/Taissa-Farmiga--... [16:38:30] !log upgrading cp4018 (text-ulsfo) to varnish 4 -- T131503 [16:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:36] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [16:41:30] PROBLEM - Varnishkafka log producer on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [16:42:30] RECOVERY - Varnishkafka log producer on cp4018 is OK: PROCS OK: 3 processes with command name varnishkafka [16:42:33] 06Operations, 06Security-Team: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2772112 (10Reedy) [16:43:24] 06Operations, 06Security-Team: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2772125 (10Reedy) [16:44:12] madhuvishy: just to to be on the safe side do we have data on those shelves? [16:44:35] madhuvishy: that we need to keep? [16:44:38] papaul: we do, but it's fine if they were to be destroyed [16:44:50] madhuvishy: if i have to rebuild the RIAD? [16:44:55] papaul: that's fine [16:45:31] (03CR) 10Filippo Giunchedi: [C: 032] carbon_pickled: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/319878 (owner: 10Muehlenhoff) [16:45:55] madhuvishy: cool do you know how to creat Virtiual disks? [16:46:43] papaul: I've never done it [16:47:18] i'm happy to do it if you tell me how :) [16:47:53] 06Operations, 10Traffic, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772143 (10BBlack) Ignore the above comment, the URL is wrong. What @fgiunchedi pasted is right, you just have to connect to `rendering.svc.eqiad.wmnet` while usi... [16:49:44] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772146 (10GWicke) > In the current examples, I think it's unfortunate that height-constraining isn't considered. Not as a feature that would be availa... [16:49:48] madhuvishy: do you want for us to do it together? [16:49:54] papaul: sure [16:51:00] madhuvishy: ok will ping you when i am done at clinic [16:52:18] papaul: aah, i'm going to be afk for a bit from 11 to 1.30 or so [16:52:32] (PST) [16:53:46] madhuvishy: ok [16:55:40] (03PS1) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [16:56:08] (03PS1) 10Gehel: Maps - tilerator on all maps servers needs access to postgresql master [puppet] - 10https://gerrit.wikimedia.org/r/319893 (https://phabricator.wikimedia.org/T147223) [16:56:44] (03CR) 10Reedy: "This may be better with logging to file, and the --verbose flag too" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [16:56:58] (03CR) 10jenkins-bot: [V: 04-1] Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [16:58:01] (03PS2) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [17:00:59] (03CR) 10Chad: [C: 04-1] Add cronjob for regenerating captchas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:01:48] (03CR) 10Reedy: "I don't disagree ;)" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:02:51] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772169 (10GWicke) [17:05:21] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [17:05:31] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [17:07:02] (03CR) 10Chad: "No time like the present to fix it, while you're here :)" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:07:45] (03CR) 10Reedy: "Question would be for maybe aaron... How often do these word lists change? Is it going to be a pain to have it somewhere like private pupp" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:09:47] (03CR) 10Chad: "Eh, putting the badword list public allows someone to limit their dictionary if they're trying to build a list of known words, best to kee" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:10:52] 06Operations, 10Traffic, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772190 (10BBlack) [17:11:08] 06Operations, 10Traffic, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack) Title/desc fixed up to not implicate Swift :) [17:13:38] AaronSchulz: Any reason we can't just put your captcha wordlists into the private puppet? :) [17:13:44] How often do you regenerate them? I'm presuming rarely [17:14:18] <_joe_> there is no way we're gonna merge something that refers to a user's homedir [17:14:23] _joe_: I know [17:14:43] I'm just checking there's no obvious reason why they may need to reside there [17:14:54] _joe_: awwwwww, come on! institutionalized spaghetti code! [17:15:05] <_joe_> greg-g: I am imagining what follows [17:15:12] :) :) [17:15:18] ain't no party like a captcha party [17:15:26] <_joe_> in 20 years aaron moves on, and magically his home dir gets wiped upon offboarding [17:15:33] _joe_: don't be daft [17:15:33] <_joe_> aand magically that breaks captcha [17:15:37] we don't offboard [17:16:06] cause a captcha party don't stop? [17:16:16] <_joe_> and some poor ops guy will have to restore the lists from a backup, cursing all of us who merged this. I have enough bad karma as it is :P [17:16:31] heh [17:17:22] what backup... [17:17:22] so: no matter what AaronSchulz says, it's going in private puppet. {{done}} [17:17:26] apergos: zing [17:17:36] good [17:18:10] -rw-r--r-- 1 aaron wikidev 7206 Sep 25 2014 /home/aaron/badwords [17:18:10] -rw-r--r-- 1 aaron wikidev 938848 Sep 25 2014 /home/aaron/words [17:18:18] I'm guessing they don't get updated very often ;) [17:18:18] (03CR) 10Filippo Giunchedi: "recheck" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/316305 (https://phabricator.wikimedia.org/T148363) (owner: 10Gilles) [17:18:54] do we care if the bad words are public? [17:18:59] is what i was wondering [17:19:07] ostriches suggested so [17:19:09] * greg-g shrugs [17:19:11] Which I can understand [17:19:20] It's helping someone limit their search criteria [17:19:31] even if it doesn't help much... still [17:19:39] mmm [17:19:47] better OCR will no doubt have more of a change [17:20:02] (03PS2) 10Filippo Giunchedi: Prevent Thumbor from creating files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/319802 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [17:20:04] Keeps it simple too, since wordlist and badwords will be side by side. [17:20:17] _joe_: And yes, moving it out of /home (broadly) is the reason for my -1 [17:20:31] (nothing against Aaron :)) [17:20:35] well, let me know if you need something added to private repo [17:21:05] probably want some better filenames too [17:21:11] and a txt extension? or don't we care? [17:21:35] Eh, something like /etc/fancycaptcha/wordlist and /etc/fancycaptcha/badwords would be self explanatory to me [17:22:32] ostriches: I guess, that counts as a better filename too [17:22:51] (03CR) 10Filippo Giunchedi: [C: 032] Prevent Thumbor from creating files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/319802 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [17:24:00] mutante: If you want to jfdi and add it to the private repo, that WFM... Just need to update the patch to place the files I guess, and obviously use them [17:25:08] We'll need some dummy files for beta too, or puppet will break [17:25:18] Eh, assuming the cron runs there [17:25:45] honestly no idea if crons run there [17:25:46] yea, but not a problem, just add file with "fuckthisshit" in labs/private [17:26:02] Do we use captcha's on labs? :P [17:26:05] /beta [17:28:14] 99171 "words" 877 "badwords" [17:33:25] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:42] (03CR) 10Filippo Giunchedi: "Mhhh I've applied this on thumbor1001 but doesn't seem to have had an effect on the processes themselves as reported by procfs:" [puppet] - 10https://gerrit.wikimedia.org/r/319802 (https://phabricator.wikimedia.org/T145878) (owner: 10Gilles) [17:35:32] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2772370 (10RobH) a:03mark I'd like to allocate spare pool system WMF4726 for this request. It has the following specs: * Dual Intel® Xeon® Processor E5- 2623 V3 (3.0GHz/4C... [17:39:28] 06Operations, 10Traffic, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772387 (10BBlack) The AES flip doesn't seem to have had any notable influence on performance metrics, either. Now after all the above merges, our server-side ciph... [17:41:37] 06Operations, 10Traffic, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772402 (10BBlack) Another notable effect was that there's apparently a small subset of clients out there in the world (somewhere in the 0.1% -> 1% ballpark) who im... [17:41:41] Reedy: ostriches: added to private repo, you should be able to access them via: [17:42:06] content => secret('fancycaptcha/badwords'); (and just "words") [17:42:35] so you can let puppet install them from there to /srv/org/wikimedia/ or wherever [17:42:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [17:42:59] eh, /etc you said , i guess [17:43:09] It's a file {} thing it needs, right? [17:43:37] yes, right [17:44:00] example: modules/mw_rc_irc/manifests/ircserver.pp line 16 ++ [17:44:04] !log moved cr1-eqiad:ae4 links from asw-d-eqiad:ae1 to to asw2-d-eqiad:ae1 [17:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:44] (03PS3) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [17:46:49] Reedy: Yeah two file {} stanzas, then make the cron require => [File, File] for both of them [17:46:50] :) [17:47:06] gah, trailing whitespace [17:47:23] Oh, and probably pass that ensure to the files too [17:47:24] So if we move the cron the files won't remain [17:48:28] (03CR) 10Dzahn: Add cronjob for regenerating captchas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [17:48:54] (03PS4) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [17:50:06] RECOVERY - MegaRAID on db1051 is OK: OK: optimal, 1 logical, 2 physical [17:50:17] ostriches: Think it's worth logging the output to file? [17:51:06] Eh, would anyone care? [17:51:26] Just debugging purposes maybe [17:52:42] Or, after "why have we not got any captcha's left" [17:54:25] !log reactivating cr1-eqiad:ae4 and its subinterfaces (VRRP bug seems to have been worked around) [17:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:05] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1066 is CRITICAL: connect to address 10.64.0.103 and port 3128: Connection refused [18:00:05] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.001 second response time [18:01:12] !log moving mc1033-mc1036 from asw-d-eqiad to asw2-d-eqiad [18:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:19] (03PS1) 10BBlack: varnish-backend-restart: sync + wait more [puppet] - 10https://gerrit.wikimedia.org/r/319896 [18:02:25] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:02:53] (03CR) 10BBlack: [C: 032 V: 032] varnish-backend-restart: sync + wait more [puppet] - 10https://gerrit.wikimedia.org/r/319896 (owner: 10BBlack) [18:06:06] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772466 (10faidon) [18:07:44] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) asw<->asw2 links are done, 4x10G on racks D 2 and D 7 (2 each). The 4x10G links from cr1-eqiad:ae4 to asw-d-eqiad:ae1 have been moved over to asw2-... [18:11:43] !log uploaded new jessie linux package based on 4.4.30 to carbon [18:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:16] (03CR) 10Filippo Giunchedi: Initial commit (035 comments) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [18:13:26] (03PS5) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) [18:14:39] (03PS1) 10Ppchelko: RESTBase: Add baseUriTemplate parameter. [puppet] - 10https://gerrit.wikimedia.org/r/319897 [18:14:52] 06Operations, 10Mobile-Content-Service, 06Services, 07Service-deployment-requests: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772489 (10Mholloway) [18:17:09] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772494 (10Pchelolo) [18:17:55] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:05] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:32] [10:18:54] do we care if the bad words are public? <-- please make the list public, we already have a partial badlist in the ConfirmEdit extension [18:19:35] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [18:19:35] RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [18:21:45] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[tshark],Package[tmux] [18:22:18] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:33] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772530 (10Fjalapeno) [18:23:59] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: lutetium RAID disk failed - https://phabricator.wikimedia.org/T149904#2768426 (10Cmjohnson) disk swapped. [18:24:01] (03PS1) 10Yuvipanda: statistics: Don't install spelling libraries on debian [puppet] - 10https://gerrit.wikimedia.org/r/319898 [18:24:19] 06Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964#2772533 (10Cmjohnson) disk swapped [18:25:23] 06Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964#2772538 (10Cmjohnson) 05Open>03Resolved Raid is optimal again...resolving [18:26:12] (03PS2) 10Yuvipanda: statistics: Don't install spelling libraries on debian [puppet] - 10https://gerrit.wikimedia.org/r/319898 [18:26:20] (03CR) 10Yuvipanda: [C: 032 V: 032] statistics: Don't install spelling libraries on debian [puppet] - 10https://gerrit.wikimedia.org/r/319898 (owner: 10Yuvipanda) [18:26:45] (03CR) 10Mobrovac: [C: 031] RESTBase: Add baseUriTemplate parameter. [puppet] - 10https://gerrit.wikimedia.org/r/319897 (owner: 10Ppchelko) [18:44:00] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10Dzahn) I agree that we should not have disabled notifications _without_ a comment on them, ideally a reference to a ticket every time. But it's ok to have them if they have a comment AND... [18:48:20] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:48:50] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:55:08] (03CR) 10Reedy: "Soft dependancy on Id23483286ae2549bfd6f1377c6a0d0c0898b88c4 and Iaedf0d4903c0fd9a9cca3e648a2a9691f54c6af8 being merged... Needs --oldcapt" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [18:56:53] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:55] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2772112 (10Reedy) [19:00:11] RECOVERY - check_raid on lutetium is OK: OK: MegaSAS 1 logical, 2 physical [19:00:16] BTW… if any of you ops guys care… I’m continuing to try to push these big old broken transcodes through the video scalers, but I’m watching ganglia and only putting 40-50% load on them… other people have uploaded stuff, and it’s having no trouble being ‘responsive’ to handling normal uploads in just a couple of minutes. [19:01:21] Revent: thanks for the heads-up, it's good to know that in case somebody reports problems [19:02:14] connects to a random videoscaler [19:02:41] yes, looks busy but not in a problematic way to me [19:02:42] mutante: I suspect (given the size of these files, and that they were all about the same age) that someone did a bunch of server side uploads and nailed it so hard they all times out. [19:03:14] I felt kinda bad about setting them on fire earlier. :/ [19:03:39] ah, i didnt know something happened earlier [19:03:40] I did not realize it would try to eat them all at the same time. [19:04:03] Yeah, I pegged them at 270% load for about 6 hours…. [19:04:18] (03CR) 10Reedy: "2 other outstanding questions;" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [19:05:08] I promise to not do that again… [19:05:09] well, keep it at this level i guess, they look ok to me in monitoring [19:05:12] thanks [19:05:37] yea, there are not that many of them [19:06:10] I see ones at codfw tho… hopefully sometime? [19:06:59] yes, there are 2 in codfw [19:07:16] They look like much more powerful machines. [19:07:17] at least they are in puppet [19:07:43] At least from what ganglia says [19:08:04] 06Operations, 06Security-Team, 13Patch-For-Review: Create cronjob for regular captcha regeneration - https://phabricator.wikimedia.org/T150029#2772632 (10Reedy) Related to the other tasks.. T125132 and T141490 Is a fill of 10,000 enough? Do we need to delete captchas somehow? If so, how do we do so? htt... [19:08:23] yea, pretty sure they are, just because time has passed since the ones for eqiad were setup [19:09:27] I’m guessing nobody has ever really specifically ‘looked’ at the pile of old broken transcodes on Commons… the ‘count’ is insane. [19:09:46] Hopefully most are small files. [19:10:21] yea, i dunno that, but probably not. would have to ask commons channel [19:10:21] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:12:29] Can we put the codfw machines into rotation for transcoding? Or just a PITA? [19:14:23] good question [19:15:08] maybe we should find that out in a ticket [19:17:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:21:01] (03Abandoned) 10Ori.livneh: [DNM] hack maintain-replicas.pl for adywiki/jamwiki [software] - 10https://gerrit.wikimedia.org/r/295564 (https://phabricator.wikimedia.org/T135029) (owner: 10Ori.livneh) [19:21:29] (03Abandoned) 10Ori.livneh: Parametrize supplementary response headers in vcl_config [puppet] - 10https://gerrit.wikimedia.org/r/294171 (owner: 10Ori.livneh) [19:22:48] (03Abandoned) 10Ori.livneh: Add quick-n-dirty logging function hackLog() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277585 (owner: 10Ori.livneh) [19:23:20] (03PS3) 10Ori.livneh: Report save timing by MediaWiki version [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) [19:24:31] 10Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2772659 (10ksmith) Does this patch being abandoned mean that this issue is no longer fixed? [19:25:25] (03CR) 10Ori.livneh: [C: 032] Report save timing by MediaWiki version [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) (owner: 10Ori.livneh) [19:25:51] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:31:52] (03PS1) 10Reedy: Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 [19:33:44] (03PS2) 10Reedy: Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 [19:34:11] (03PS2) 10Dzahn: delete snaprotate.pl from files/backup/ [puppet] - 10https://gerrit.wikimedia.org/r/318450 [19:34:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:34:38] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772675 (10Mholloway) [19:35:59] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:11] (03PS3) 10Dzahn: delete snaprotate.pl from files/backup/ [puppet] - 10https://gerrit.wikimedia.org/r/318450 [19:36:22] 06Operations, 10Mobile-Content-Service, 07Service-deployment-requests, 06Services (watching): New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772692 (10Fjalapeno) [19:38:05] (03PS4) 10Dzahn: delete snaprotate.pl from files/backup/ [puppet] - 10https://gerrit.wikimedia.org/r/318450 [19:38:24] (03CR) 10Dzahn: [C: 032] "https://wikitech.wikimedia.org/w/index.php?title=Database_snapshots&action=history" [puppet] - 10https://gerrit.wikimedia.org/r/318450 (owner: 10Dzahn) [19:38:55] (03PS5) 10Dzahn: delete snaprotate.pl from files/backup/ [puppet] - 10https://gerrit.wikimedia.org/r/318450 [19:38:57] (03Abandoned) 10Ori.livneh: varnish: report response age to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/269086 (owner: 10Ori.livneh) [19:39:19] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:43:26] (03CR) 10Dzahn: ""awards a token" - thanks for this !:)" [puppet] - 10https://gerrit.wikimedia.org/r/319544 (owner: 10Alexandros Kosiaris) [19:50:06] (03PS3) 10Reedy: Shift 10 more extensions to use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319907 (https://phabricator.wikimedia.org/T140852) [19:54:19] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [19:55:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3046034 keys, up 4 days 11 hours - replication_delay is 34 [20:00:18] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:15] (03PS1) 10Andrew Bogott: Add wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/319909 [20:03:58] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:04:15] (03PS2) 10Andrew Bogott: Add wmfkeystonehooks [puppet] - 10https://gerrit.wikimedia.org/r/319909 [20:09:36] 10Blocked-on-Operations, 06Operations, 10DBA, 06Labs, and 2 others: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2286195 (10AlexMonk-WMF) No, the proper version of the script was replaced in T138450, Ori's DNM version of the... [20:10:18] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:13:58] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:27:44] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772805 (10Marostegui) dbstore2002 is now running 10.0.28, mysql_upgrade went fine and tokuDB engine is loaded. The slaves are catching up too. ``` root@dbstore2002:/opt/wmf-mariadb10/bin#... [20:31:37] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2772815 (10Marostegui) [20:31:40] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Reimage dbstore2002 - https://phabricator.wikimedia.org/T150017#2772814 (10Marostegui) 05Open>03Resolved [20:38:22] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:42:01] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:56:54] PROBLEM - Disk space on einsteinium is CRITICAL: DISK CRITICAL - free space: / 1670 MB (3% inode=97%) [20:57:11] heh I noticed the warning, already looking [20:57:15] spoiler alert: it is the logs [20:58:13] (03PS5) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [21:00:13] (03CR) 10jenkins-bot: [V: 04-1] Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [21:01:34] (03PS6) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [21:02:58] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2772912 (10EBernhardson) [21:04:13] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2772912 (10EBernhardson) This was fixed for the moment by running `update-ca-certificates --fresh` on all instances in the beta cluster, but should pr... [21:06:54] RECOVERY - Disk space on einsteinium is OK: DISK OK [21:06:56] !log compress huge daemon.log on einsteinium into /srv/ [21:06:57] (03PS1) 10Ori.livneh: Fix-up for Ia07b03f12b [puppet] - 10https://gerrit.wikimedia.org/r/319916 [21:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:12] neat, icinga is spot on nowadays [21:07:27] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772934 (10Tgr) >>! In T66214#2767815, @Gilles wrote: > These issues have been decoupled. Thumbor is currently set to be a drop-in replacement for imag... [21:07:44] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:03] (03CR) 10jenkins-bot: [V: 04-1] Fix-up for Ia07b03f12b [puppet] - 10https://gerrit.wikimedia.org/r/319916 (owner: 10Ori.livneh) [21:10:11] (03PS2) 10Ori.livneh: Fix-up for Ia07b03f12b [puppet] - 10https://gerrit.wikimedia.org/r/319916 [21:13:48] (03CR) 10Ori.livneh: [C: 032] Fix-up for Ia07b03f12b [puppet] - 10https://gerrit.wikimedia.org/r/319916 (owner: 10Ori.livneh) [21:14:03] !log T133395: Starting user-defined compaction of local_group_wikipedia_T_parsoid_html.data, files la-169018-big-Data.db and la-171488-big-Data.db [21:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:09] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [21:18:41] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772979 (10GWicke) [21:18:57] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2772982 (10fgiunchedi) [21:22:51] (03CR) 10Dzahn: [C: 031] "puppet part looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [21:31:07] (03PS1) 10Dzahn: add fake word lists for fancycaptcha in beta [labs/private] - 10https://gerrit.wikimedia.org/r/319924 [21:34:25] (03CR) 10ArielGlenn: [C: 031] "without a second thought." [labs/private] - 10https://gerrit.wikimedia.org/r/319924 (owner: 10Dzahn) [21:34:51] (03CR) 10Alex Monk: [C: 031] add fake word lists for fancycaptcha in beta [labs/private] - 10https://gerrit.wikimedia.org/r/319924 (owner: 10Dzahn) [21:36:46] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:37:25] Reedy: Heya… must say, btw. [21:38:32] The big backlog of ‘broken’ transcodes on Commons is obviously not a ‘big deal’… obviously, it’s never broken anything until now, but it would be quite nice to knock it down. [21:39:04] (03CR) 10Dzahn: [C: 032 V: 032] add fake word lists for fancycaptcha in beta [labs/private] - 10https://gerrit.wikimedia.org/r/319924 (owner: 10Dzahn) [21:40:03] If ‘all CPUs are equal’, getting those two machines that are apparently doing nothing would throw well over twice the horsepower at it even ‘if’ twice as much was left for incoming uploads. [21:41:15] (assuming that ganglia is not lying) So if that was possible, again, it’s not like it’s an ‘urgent issue’ or anything, but it would make a difference... [21:41:47] Really only make a dent iver the long term, tho… [21:44:26] It’s hard to interpret the numbers at https://commons.wikimedia.org/wiki/Special:TimedMediaHandler in any meaningful way without thinginking something like half to a third of the transcodes on Commons are broken, and I just can’t see that being right. [21:46:32] Wow, bad spelling, lol. [21:49:01] Revent, can the software recognise a broken transcode? [21:49:30] Krenair: See https://commons.wikimedia.org/wiki/File:012015_SOTU_NoGFX_HD.webm [21:49:56] It also lists 50 or so at https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [21:50:15] But… it also indicates we have over a third of a damn million. [21:51:09] I have no impression at all, from actual experience of seeing ‘errored’ transcodes, how accurate that is. [21:51:34] But… I don’t commonly ‘look at’ that info when doing stuff. [21:52:00] brion: ^ something you might have some interest in [21:52:22] I have seen some ‘not working’ transcodes that did not indicate an error, but they were rather obviously due to malformed source files. [21:52:45] i.e. the transcode was mangled noise. [21:52:51] Sec... [21:53:06] This one https://commons.wikimedia.org/wiki/File:Inclusion-flotation-driven-channel-segregation-in-solidifying-steels-ncomms6572-s4.ogv [21:53:51] The ‘successful’ transcodes were after me messing with it, and if you look closely you’ll see they are not… quite right. [21:54:14] But that seems likely to be a broken source file. [21:55:51] The thumbed version of the original upload, before I ‘kinda’ fixed it… the ‘full res’ worked okay, but all the supposedly ‘successful’ transcodes were just that same orange noise. [21:57:32] There are some others… someone else deleted them, but the ‘original source’ was AVI files, uploaded by ‘open access bot’. They were uploaded as ogv, and mediawiki could transcode them to other ogv, but consistently errored out when trying to convert them to webm. [21:59:30] The videoconvert on labs, also, would convert the AVI to a non-functional WebM… eventually found an online converter that would create a WebM that would convert back to OGV, and replaced the files on Commons with the ‘other format’ to get both through a transcode... [22:00:12] I suspect all that is a seperate issue, though, with ‘external source’ files that are themselves somehow fucked up. [22:00:42] I am very much not a coding guy, tho. [22:00:54] yo [22:01:03] o/ [22:01:44] Revent: the transcode system needs a lot of love :) [22:01:53] Revent: There's likely a lot of edge cases... Things not well tested. Weird and wonderful [22:02:29] Revent: please do file interesting cases you find in phab with specific links, i want to make sure the next generation of things is much improved [22:02:41] especially when it comes to things like bad format validation! [22:02:54] You can see deleted stuff, ye? [22:03:00] Sec... [22:03:39] https://commons.wikimedia.org/wiki/Commons:Deletion_requests/File:El-Ni%C3%B1o-and-coral-larval-dispersal-across-the-eastern-Pacific-marine-barrier-ncomms12571-s3.ogv [22:03:46] The deleted files there... [22:04:22] The ‘original source’ was AVIs, themselves rather clearly kind of broken in some obscure way. [22:04:31] hmm, i don't seem to have permissions there no [22:04:50] but i can get em later to track it down if that'll help :) [22:04:54] Umm… download the AVI’s linked as sources... [22:05:13] well, no avi should be uploaded, that'd be a bug :) [22:05:19] Yea. [22:05:40] The thing was... [22:06:01] The uploaded ogvs would transcode to ‘other’ ogvs, but not to webm. [22:06:11] ahhhh, interesting [22:06:20] that probably points to a difference between ffmpeg and ffmpeg2theora [22:06:31] And the ‘videoconvert’ tool on labs, that I thing uses the same library... [22:06:32] (ffmpeg2theora is used for the .ogv output) [22:07:03] Would produce an WebM that did not play. [22:07:51] I used several different online converters to try to create ‘different’ ogvs that would convert, with no luck [22:08:22] And until I finally got lucky, made WebMs… that would not play... [22:08:35] Revent: one thing i notice is the ogv is using 4:4:4 color resolution [22:08:44] this is relatively rare [22:09:01] and if i'm not mistaken, not supported by webm vp8 [22:09:05] that might be it... [22:09:12] Even converting the AVI to MOV with quicktime, and THEN converting it to OGV produced a file that would not transcode to WebM. [22:09:26] Revent: i think you can force the output colorspace to 4:2:0, let me try [22:12:27] It was http://convert-video-online.com/ that, finally, produced a WebM that would convert back to OGV, which is how we resolved it. [22:12:51] Presumably they are using a different library. [22:13:02] so just straight converting with ffmpeg to webm seems ok on my local computer: ffmpeg -i ncomms6572-s4.avi -c:v libvpx -c:a libvorbis foo3.webm [22:13:35] or is that a working file... [22:13:56] lemme test the el nino one [22:14:08] upload the webm to testwiki and see if it transcodes. [22:14:27] (03CR) 10Dzahn: "fake words added to labs/private for beta https://gerrit.wikimedia.org/r/#/c/319924/" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [22:15:22] It has to, I think, be something about either bugs in the libraries or incompatabilites in the codecs that that choose…. [22:15:43] Revent: ok trying https://test.wikipedia.org/wiki/File:Transcode-testing-from-avi.webm [22:15:55] hah! [22:15:58] the webms all fail [22:16:01] yeah something's wrong there [22:16:19] is it specific to the file or is any upload failing? might be changes in production environment [22:16:27] It’s not ‘any [22:17:19] It’s, as far as I have seen, specific to files that are uploaded by the Open Accces Media Importer bot… [22:17:38] Not that it is a flaw in the bot, but something about files from that source. [22:17:57] yeah lemme test this in my test environment where i can get more detail [22:18:02] Are all of it's uploads borked? [22:18:08] If so, might be worth telling it's owner to stop [22:18:15] https://test.wikipedia.org/wiki/File:Bunny.ogv <- all find here with random test clip [22:18:22] Reedy: I have not tested ‘that many’... [22:18:46] I’ve been mostly hitting upload that seemed to fail due to timing out... [22:18:51] Revent: can you file a task in phabricator.wikimedia.org with urls to any you know are broken? [22:19:00] this is definitely something funky i need to investigate :) [22:19:34] They are, mainly at this point, the ones for the source of the files in that DR. [22:19:45] There was, also… hmmm [22:20:45] https://commons.wikimedia.org/wiki/File:Inclusion-flotation-driven-channel-segregation-in-solidifying-steels-ncomms6572-s4.ogv I think I mentioned. [22:21:13] ok i see: Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height\n [22:21:39] \n literal? :P [22:21:50] well in the string i have [22:21:57] which is wfDebug() output [22:22:12] actual \n in the source output :) [22:22:25] i'm wondering if it doesn't like odd widths on the file o_O [22:22:41] Frankly, I’ve just begun poking at the failed transcodes, and how few are shown in https://commons.wikimedia.org/wiki/Special:TimedMediaHandler and that most so far are just ‘bigass files that timed out’ has limited the source material. [22:23:19] It’s the [22:23:24] Er... [22:23:36] wait what's this? Failed to initialize encoder: Invalid parameter\n[libvpx @ 0xe7f780] Additional information: g_timebase.num out of range [1..cfg->g_timebase.den] [22:23:53] ok that's even weirder [22:24:22] It’s the basic claim on the page that something like a third of all transcodes have failed that really bothers me. [22:24:37] Revent: there's huge masses of old cruft in that [22:24:47] Understood. [22:24:47] including memory of old bugs where all the audio files got a failed transcode :) [22:25:58] It’s one of, many, tasks added to my backog, but one that hopefully can be mostly run in the background. [22:27:09] If the whole list was visible… not just the latest 50, my understanding is that only admins can kick a transcode… [22:28:27] ok the same run that failed in my test environment worked in my local environment. [22:28:28] sighhhhh [22:28:30] I wonder if it wuold be worth running an admin bot, coded by someone smarter than me, to try kicking all failed transcodes exactly once. [22:29:24] With respect to the fact that it’s clearly possible to make the video scalers implode, lol. [22:29:30] that's the danger ;) [22:29:34] currently very few machines in that pool [22:29:48] brion: I dunno if you know that I did exactly that earlier. [22:29:48] but yes, needs a cleanup pass... [22:30:07] It may be possible to pool the codfw boxes too [22:30:32] ok i've got several issues to track down here [22:30:34] Misunderstanding how the queue worked, 270% load for 6 hours…. [22:30:50] * bad error reporting (it's seen in logs but not exposed to admin) [22:31:07] * specific fail in the ffmpeg transcodes that doesn't affect ffmpeg2theora transcodes [22:31:15] * general bad reporting ;) [22:31:51] Revent: yeah it'd work better with a good exposure of the queue length to manage when to add more transcodes [22:32:16] * and possibly a version-specific ffmpeg issue [22:32:21] YES, if we could see what the error was instead of just kicking the file back in that would be outstanding. [22:32:31] yeah it's supposed to show it! [22:32:35] it's just .... not for some reason here [22:33:40] ok i can definitely see the same command line fail within mediawiki-vagrant and work on mac os [22:34:00] With the caveat… that most Commons admins never notice and probably do not realize that kicking a re-transcode is an admin thing. [22:34:06] the mac has ffmpeg 3.1.5, much newer than the 2.7.2 on vagrant [22:34:42] humans should never have to manually re-kick a transcode, it indicates a low-level failure [22:34:56] but of course it's a house of cards and failure happens :D [22:35:54] brion: I did not install it locallly and try to transcode, my ‘macness’ is very much a transient thing base upon acces to a 21” display, lol. [22:36:20] :D [22:36:44] Revent: try forcing the webm or ogv to have a specific frame rate like 15 or 30 [22:36:58] i think it's being confused about the super-low frame rate actually [22:37:28] I can try thet when I hit another one... [22:39:02] Revent: yeah https://test.wikipedia.org/wiki/File:Transcode-testing-from-avi-30fps.webm looks happier so far [22:39:10] i added "-r 30" on the end to force frame rate to 30fps [22:39:45] i'll see if we can either fix or work around on our end though, that should not fail with low rates [22:39:56] Frankly, the very low limit in TimedMediaHandler on how many problematic files are visible (50, if I counted right) is a major issue, given that Commons claims to have 350k problematic files. [22:40:12] agreed! [22:40:29] the entire management ui needs a rewrite [22:41:08] I cannot meaninfully ‘select’ amoung them for what seem to be likely to expose a certain issue.. [22:41:26] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2773164 (10Tgr) So I guess there are three options: * Try to integrate Kafka as a jobqueue backend (ie. write a `JobQueue... [22:42:44] I’m buried in Obaa ranting at 1080p for an hour, lol. [22:42:53] *Obama [22:43:21] that'll take a while to transcode :D [22:43:37] Revent: ok here's for the transcode err: https://phabricator.wikimedia.org/T150066 [22:43:44] brion: Give me those idle servers on codfw! [22:43:54] !log stop puppet on einsteinium and tegment to avoid log spam - T150061 [22:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:01] T150061: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061 [22:44:34] Reedy: if you can make those idle servers happen i'd be happy to help revent use them up ;) [22:44:58] brion: Would need some opsen at least... [22:45:03] yeah [22:47:02] Unfortunately, I think Obama et al ranting would consume them for a bit, basdon what shows up in the broken trnascode page. [22:47:21] *transcode [22:48:06] yes there's also an ongoing bug that long transcodes don't reserve their cpu time properly because the job queue thinks they're done early [22:48:19] which means other transcodes start running, slowing down the first one [22:48:45] so basically the building's on fire, but it's ok ;) [22:48:53] 06Operations: Use codfw videoscalers - https://phabricator.wikimedia.org/T150067#2773188 (10Reedy) [22:49:24] brion: Building on fire? (lol) [22:49:37] I bet just using the codfw ones would be quicker [22:49:39] 06Operations: Use codfw videoscalers - https://phabricator.wikimedia.org/T150067#2773188 (10brion) I'm going to run major batch re-runs and have been waiting on there being more capacity, as well as fixing various bugs. More capacity would be great. [22:49:46] https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [22:50:00] That was me, accitdentailly. [22:50:11] heh [22:50:33] i gotta run -- make sure everything's in phab and cc me if necessary :) [22:50:56] and once i'm done with fixing subtitles i'll get back to improving the management interface... [22:51:13] 06Operations, 10Wikimedia-General-or-Unknown: Use codfw videoscalers - https://phabricator.wikimedia.org/T150067#2773214 (10Reedy) [22:59:19] 06Operations, 10Wikimedia-General-or-Unknown: Use codfw videoscalers - https://phabricator.wikimedia.org/T150067#2773188 (10Revent) A 'lot'... does a third of a damn million qualify? It appears to be well over twice the existing capacity, just sitting there burning electricity to no actual purpose. [23:03:10] Sorry for the rudeness, but… really, have those machines realy been sitting idle while eating electricity for a year now? [23:05:37] mostly yes, i think. codfw is set up to pick up everything from eqiad within a few minutes' notice, for availability in case of some catastrophic failures somewhere. [23:09:02] A backup capacity makes sinse, but… to let it sit completely idle when the primary servers are maxed out also makes no sense [23:11:35] Revent: so your suggestion is just to let both clusters get used up so we have no redundancy in case of emergency, compared to giving brion some time to look at fixing the issues with the jobrunner causing them to get overloaded? [23:12:46] Freely admitting…. I personallly killed the video scalers, but it appears that over twice the capacity was sitting idle. [23:14:38] there is multi-dc work being done (to enable us to use both dcs actively). It's not a simple problem ("just add these to a list somewhere"). [23:17:25] p858snake: It’s rather hard to imagine a video scaling emergency of any sort, much less one that would suddenly requrire over twice the existing capacity. [23:19:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [23:20:06] Revent: two emergencies would be eqiad losing network connectivity or power and requiring a quick swap over to codfw [23:20:07] Revent: well to be honest, the current situation is not really a big emergency either, and it is friday afternoon after all. [23:20:39] i don't think ops people like making any changes just before the weekend ;) [23:21:15] whilst i'm saying we shouldn't utilize some of our hot spares, but we need to ensure any temp fixes don't become permanent and effect our emergency fallover procedures [23:21:33] (and like greg says, being able to use both simultaneously is probably a lot harder than being able to use one or the other) [23:25:38] I am not claiming it’s an emergency..that would be siilly…but… you must admit, an emergency capacity that is over twice the normal capacity is equallly silly,. [23:39:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:39:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:41:22] Revent: hi, why do you think that's silly? [23:42:00] Servers that were bought a few years apart potentially [23:42:06] Of course they're gonna differ [23:42:16] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:43:39] codfw has had more powerful appservers for a long time. I believe it will never be truly the same [23:44:26] Eventually, the EQIAD ones will be replaced [23:44:30] And they'll be more powerful [23:44:49] Unless both were kitted out at the same time, it'll never be equal... And it probably doesn't need to be [23:45:02] Yeah, you'll have to wait until the current out of warranty R420s will be replaced with new ones [23:45:53] A third video scaler shouldn't hurt. But the video scaling cluster was not in direct need of better servers I guess [23:46:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:46:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:47:04] that will come soon likely. The operations team will know when it makes sense to get replacements for the current video scalers, or adjusting the current process (i.e. multi-DC support) [23:47:16] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:49:11] getting dnserror for hewiki [23:49:43] as in page can't be displayed [23:49:57] That doesn't sound like a dns error [23:50:14] Is that microsoft edge? [23:50:29] IE [23:50:53] Was previewing a page edit, and it went out [23:51:05] hm, strange [23:51:07] Works in IE for me [23:51:20] can you reproduce it? [23:51:26] Nope [23:51:31] Going to hewiki mainpage in a new tab also says cant be displayed [23:52:12] Pinging he.wikipedia.org [2620:0:862:ed1a::1] with 32 bytes of data: seen to work tho [23:52:25] is this happening in any other browser? [23:55:05] Now it works, but i lost my edits as it just loaded the editor with the full page in. Uh [23:59:28] The point is it was unavailable at all for me for some mins, and it wasn't a connectivity problem because other stuff continued to work (e.g. a file I was downloading, didn't got interrupted)