[00:53:55] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3731397 (10mmodell) >>! In T179353#3730246, @mobrovac wrote: > More to the point of the problem, perhaps a viable alternative here would be for Scap to detect the... [01:04:14] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1509671051 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3795702 keys, up 4 minutes 9 seconds - replication_delay is 1509671051 [01:04:23] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:44] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:05:14] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8494692 keys, up 5 minutes 10 seconds - replication_delay is 0 [01:05:23] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3787354 keys, up 5 minutes 13 seconds - replication_delay is 0 [01:05:34] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509671132 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3797802 keys, up 5 minutes 30 seconds - replication_delay is 1509671132 [01:05:34] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509671132 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3795451 keys, up 5 minutes 29 seconds - replication_delay is 1509671132 [01:05:43] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509671138 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3799800 keys, up 5 minutes 35 seconds - replication_delay is 1509671138 [01:05:44] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8392043 keys, up 5 minutes 39 seconds - replication_delay is 0 [01:07:43] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3791716 keys, up 7 minutes 37 seconds - replication_delay is 0 [01:08:43] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3789547 keys, up 8 minutes 38 seconds - replication_delay is 0 [01:08:44] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3787765 keys, up 8 minutes 37 seconds - replication_delay is 0 [01:35:03] PROBLEM - Check whether ferm is active by checking the default input chain on scb1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:35:53] RECOVERY - Check whether ferm is active by checking the default input chain on scb1002 is OK: OK ferm input default policy is set [01:39:23] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[systemd-timesyncd] [01:42:33] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured [01:42:33] il 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a [01:42:33] ved: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [01:45:33] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/list/tool/{tool} (Get the tools for all language pairs) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received [01:46:23] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [01:48:14] PROBLEM - Host mw1191 is DOWN: PING CRITICAL - Packet loss = 100% [01:49:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:04:23] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:26:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 880.40 seconds [03:26:32] what's up with all the redis, mobileapps, scb, cxserver, mw1191, etc noise above? [03:56:41] (03PS1) 10BBlack: new globalsign unified certs [puppet] - 10https://gerrit.wikimedia.org/r/388378 [04:05:07] (03CR) 10BBlack: [C: 032] new globalsign unified certs [puppet] - 10https://gerrit.wikimedia.org/r/388378 (owner: 10BBlack) [04:06:33] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 143.85 seconds [06:25:01] !log Stop MySQL on db1103 to copy it to dbstore1001 and reimage it - T178359 [06:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:10] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:33:00] (03PS1) 10Marostegui: install_server: Reimage db1103 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/388389 (https://phabricator.wikimedia.org/T178359) [06:35:03] (03PS1) 10Marostegui: s3,s5.hosts: Add db2085 to s3 and s5 [software] - 10https://gerrit.wikimedia.org/r/388390 (https://phabricator.wikimedia.org/T178359) [06:35:21] (03CR) 10Marostegui: [C: 032] install_server: Reimage db1103 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/388389 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:37:13] (03CR) 10Marostegui: [C: 032] s3,s5.hosts: Add db2085 to s3 and s5 [software] - 10https://gerrit.wikimedia.org/r/388390 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:38:19] (03Merged) 10jenkins-bot: s3,s5.hosts: Add db2085 to s3 and s5 [software] - 10https://gerrit.wikimedia.org/r/388390 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:39:22] !log Stop MySQL on db2089.s5 to copy it to db2085 - T178359 [06:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:28] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:29:31] !log depooled mw1191 and powercycle - host down with 'CPU 1 check errors' in racadm getsel [07:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:13] RECOVERY - Host mw1191 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [07:35:41] (03PS1) 10Marostegui: mariadb: Convert db1103 into multinstance: s2 s4 [puppet] - 10https://gerrit.wikimedia.org/r/388394 (https://phabricator.wikimedia.org/T178359) [07:40:12] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3731680 (10elukey) [07:45:34] (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/compiler02/8620/" [puppet] - 10https://gerrit.wikimedia.org/r/388394 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:57:00] !log Deploy alter table on s4.db2037 - T174569 [07:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:07] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:59:23] 10Operations, 10monitoring: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#3731703 (10Joe) All you write would be great, if only > Given that we're going to have a single Puppet role per host would be true in the short-to-midterm. I don't see that happening. So... [08:00:54] (03CR) 10DCausse: [C: 031] elasticsearch: auto reload log4j2 configuration [puppet] - 10https://gerrit.wikimedia.org/r/388130 (owner: 10Gehel) [08:01:22] 10Operations, 10monitoring: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#3731705 (10Joe) Also, don't forget our code needs to work within labs. [08:05:35] 10Operations, 10Ops-Access-Requests: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3720398 (10MoritzMuehlenhoff) Can you elaborate what you need in specific to debug wikidata performance problems? We can arrange access to all the logs you need, but perf-roots grants full root access to near... [08:12:04] !log installing openjdk-8 security updates on stretch [08:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:26] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3731726 (10elukey) Status: ``` elukey@terbium:~$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php |sort -n -k2 | tail -n 20 euwiki 237... [08:16:37] !log drop CommandInvocation_15243810 and CommandInvocation_15243810_15423246 from analytics dbs (db1046/db1047/db1108/dbstore1002) - data archived on HDFS - T166712 [08:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:43] T166712: Remove logging from labs for schema https://meta.wikimedia.org/wiki/Schema:CommandInvocation - https://phabricator.wikimedia.org/T166712 [08:16:59] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1103 into multinstance: s2 s4 [puppet] - 10https://gerrit.wikimedia.org/r/388394 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:18:57] PROBLEM - mediawiki-installation DSH group on mw1191 is CRITICAL: Host mw1191 is not in mediawiki-installation dsh group [08:20:07] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3731740 (10mobrovac) >>! In T173710#3730359, @elukey wrote: > https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from... [08:26:12] (03CR) 10Zoranzoki21: [C: 031] "> This should be OK to go out, use action=edit to test whether it's" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [08:36:11] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [08:36:43] really? [08:39:13] (03CR) 10Ppchelko: [C: 031] JobQueue: Use EventBus for all "hearted" jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) (owner: 10Mobrovac) [08:46:12] !log pool cp4021 [08:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:47] !log mobrovac@tin Started deploy [restbase/deploy@a67b4a7]: Use only the new storage for summaries - T179418 [08:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] T179418: Migrate page summary from legacy to new storage - https://phabricator.wikimedia.org/T179418 [09:06:07] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [09:07:09] (03PS2) 10Volans: Documentation: refactor documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 [09:07:44] (03PS2) 10Ema: VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) [09:09:26] !log mobrovac@tin Finished deploy [restbase/deploy@a67b4a7]: Use only the new storage for summaries - T179418 (duration: 10m 40s) [09:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:33] T179418: Migrate page summary from legacy to new storage - https://phabricator.wikimedia.org/T179418 [09:10:16] (03CR) 10Volans: [C: 032] "Documentation-only, self-merging." [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 (owner: 10Volans) [09:12:50] (03Merged) 10jenkins-bot: Documentation: refactor documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 (owner: 10Volans) [09:17:15] (03PS3) 10Ema: VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) [09:28:47] (03CR) 10Giuseppe Lavagetto: [C: 031] prometheus: add redis_exporter class and profile [puppet] - 10https://gerrit.wikimedia.org/r/325466 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [09:39:21] (03PS3) 10Ema: VCL: add layer information to X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) [09:47:26] (03CR) 10Ema: [C: 032] VCL: add layer information to X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [09:50:46] (03PS5) 10Volans: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) [09:50:48] (03PS3) 10Volans: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) [09:50:50] (03PS3) 10Volans: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) [09:52:32] (03CR) 10Volans: "Ready for review and rebased with master" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) (owner: 10Volans) [10:12:46] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [10:13:06] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on [10:13:06] ut before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [10:13:39] (03PS1) 10Muehlenhoff: Update to 1.0.2m [debs/openssl] - 10https://gerrit.wikimedia.org/r/388413 [10:13:46] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [10:13:57] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:18:24] (03PS1) 10Giuseppe Lavagetto: Increase concurrency for htmlCacheUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388416 (https://phabricator.wikimedia.org/T173710) [10:20:00] <_joe_> elukey: ^^ [10:21:07] (03CR) 10Elukey: [C: 031] Increase concurrency for htmlCacheUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388416 (https://phabricator.wikimedia.org/T173710) (owner: 10Giuseppe Lavagetto) [10:28:04] <_joe_> elukey: do you think this is worth releasing today? [10:28:23] <_joe_> I'd like to have the rest of the day to evaluate if this had any positive effect tbh [10:28:23] (03PS1) 10Marostegui: db-eqiad.php: Add a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388417 [10:28:53] (03PS2) 10Marostegui: db-eqiad.php: Add a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388417 [10:30:13] _joe_ I think that we'd need to take actons since the queue is increasing a lot every day, I'd like to test it today too [10:31:22] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2m [debs/openssl] - 10https://gerrit.wikimedia.org/r/388413 (owner: 10Muehlenhoff) [10:31:58] <_joe_> ok so, let's go [10:32:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Increase concurrency for htmlCacheUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388416 (https://phabricator.wikimedia.org/T173710) (owner: 10Giuseppe Lavagetto) [10:32:31] (03CR) 10jenkins-bot: Increase concurrency for htmlCacheUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388416 (https://phabricator.wikimedia.org/T173710) (owner: 10Giuseppe Lavagetto) [10:36:33] (03CR) 10Dzahn: "localhost should work like before, see the "Listen 127.0.0.1" lines that i don't touch in this change" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:36:59] (03PS21) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [10:39:07] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: Increase concurrency of htmlCacheUpdate jobs T173710 (duration: 00m 48s) [10:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [10:40:54] (03CR) 10Dzahn: [C: 032] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:41:22] waits before touching cobalt , touches only gerrit2001 first [10:41:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388417 (owner: 10Marostegui) [10:42:59] !log cobalt: (gerrit) temp disable puppet - gerrit2001: restart apache, apply gerrit:354078 stop listening on all interfaces [10:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] (03Merged) 10jenkins-bot: db-eqiad.php: Add a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388417 (owner: 10Marostegui) [10:43:28] (03CR) 10jenkins-bot: db-eqiad.php: Add a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388417 (owner: 10Marostegui) [10:44:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comment to db1103 status (duration: 00m 47s) [10:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:24] (03CR) 10Dzahn: "gerrit2001:" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:46:35] (03PS1) 10Marostegui: s4.hosts: db1103 s4 instance listen on 3314 [software] - 10https://gerrit.wikimedia.org/r/388418 (https://phabricator.wikimedia.org/T178359) [10:46:36] PROBLEM - HTTPS on gerrit2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:46:44] !log Compress InnoDB on db1103.s4 - T178359 [10:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [10:46:52] hah icinga-wm - using the wrong IP! [10:47:13] that just showed we need to fix the check . https://gerrit-slave.wikimedia.org is fine [10:48:01] (03CR) 10Dzahn: "https://gerrit-slave.wikimedia.org <-- works" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:50:02] (03CR) 10Dzahn: "how come it actually works right now before this?" [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [10:52:42] (03CR) 10Paladox: "> how come it actually works right now before this?" [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [10:53:04] (03PS4) 10Paladox: Gerrit: Fix nagios check [puppet] - 10https://gerrit.wikimedia.org/r/388189 [10:53:43] paladox: ok :) thanks [10:53:50] paladox: also see the icinga-wm above, heh [10:53:57] the Apache part works though [10:54:13] looks like we have the wrong $tls_host set, looking [10:54:13] (03CR) 10Paladox: "I had the same problem on labs where the check worked, but as soon as I restarted, the check failed. This is due to chads heap change he d" [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [10:58:25] (03CR) 10Addshore: [C: 04-1] Add loading of wikibase extensions from build (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [11:10:42] (03CR) 10Marostegui: [C: 032] s4.hosts: db1103 s4 instance listen on 3314 [software] - 10https://gerrit.wikimedia.org/r/388418 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:11:23] (03Merged) 10jenkins-bot: s4.hosts: db1103 s4 instance listen on 3314 [software] - 10https://gerrit.wikimedia.org/r/388418 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:11:32] (03PS1) 10Dzahn: gerrit: fix HTTPS monitoring, use service IP [puppet] - 10https://gerrit.wikimedia.org/r/388424 [11:12:42] (03PS2) 10Dzahn: gerrit: fix HTTPS monitoring, use service IP [puppet] - 10https://gerrit.wikimedia.org/r/388424 [11:12:49] !log uploaded openssl 1.0.2m-1 for to apt.wikimedia.org [11:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:05] (03CR) 10Dzahn: [C: 032] gerrit: fix HTTPS monitoring, use service IP [puppet] - 10https://gerrit.wikimedia.org/r/388424 (owner: 10Dzahn) [11:13:32] (03PS1) 10Marostegui: db-eqiad.php: Add db2084:3314 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388425 (https://phabricator.wikimedia.org/T178359) [11:14:16] (03CR) 10Marostegui: [C: 04-1] "Do not push till Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388425 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:15:29] (03CR) 10Dzahn: [C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/388424 (owner: 10Dzahn) [11:15:55] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix HTTPS monitoring, use service IP [puppet] - 10https://gerrit.wikimedia.org/r/388424 (owner: 10Dzahn) [11:17:00] (03CR) 10Dzahn: [C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/388424 (owner: 10Dzahn) [11:17:00] ABORTED in 3m 00s [11:18:34] (03PS1) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [11:18:54] due to zuul restart apparently. works again [11:20:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM, but I have a couple doubts. Also, reviewing the code made me wonder if we wouldn't be better off spinning the data-model com" (034 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/386116 (https://phabricator.wikimedia.org/T150560) (owner: 10Mobrovac) [11:24:09] (03PS2) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [11:28:55] (03PS1) 10Muehlenhoff: Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 [11:29:48] (03CR) 10jerkins-bot: [V: 04-1] Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 (owner: 10Muehlenhoff) [11:31:09] (03PS2) 10Muehlenhoff: Fix code commenting out installer apt lines with new repository layout [puppet] - 10https://gerrit.wikimedia.org/r/388427 [11:33:22] (03PS5) 10Dzahn: Gerrit: Fix nagios check [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [11:33:52] RECOVERY - HTTPS on gerrit2001 is OK: SSL OK - Certificate gerrit-slave.wikimedia.org valid until 2018-01-15 14:57:28 +0000 (expires in 73 days) [11:33:54] (03PS6) 10Dzahn: Gerrit: Fix Icinga gerrit process check [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [11:34:25] (03CR) 10Dzahn: "07:33 <+icinga-wm> RECOVERY - HTTPS on gerrit2001 is OK: SSL OK - Certificate gerrit-slave.wikimedia.org valid until 2018-01-15 14:57:28 +" [puppet] - 10https://gerrit.wikimedia.org/r/388424 (owner: 10Dzahn) [11:34:27] (03PS3) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [11:34:59] (03CR) 10jenkins-bot: Documentation: refactor documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 (owner: 10Volans) [11:35:12] (03CR) 10Dzahn: [C: 032] Gerrit: Fix Icinga gerrit process check [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [11:41:26] !log gerrit - restart Apache - apply gerrit:354078 - only listen on service IP [11:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:02] (03CR) 10Dzahn: "also applied on cobalt. https://cobalt.wikimedia.org/ is gone now" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:45:08] !log restarting jenkins on releases* to pick up openjdk security update [11:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:48] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3732744 (10MoritzMuehlenhoff) Building a git 2.11 for trusty is probably just a matter of 1-2 hours work, but it's something we would need to repeat for every git... [12:13:25] paladox: hi, so does it really change from "username =" to " user =" in itsphabricator bot? [12:27:19] (03PS10) 10Dzahn: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [12:30:16] (03CR) 10Dzahn: [C: 032] Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [12:32:58] (03Draft1) 10Paladox: Gerrit: Re add certificate for its-phabricator temporary [labs/private] - 10https://gerrit.wikimedia.org/r/388430 [12:33:01] (03PS2) 10Paladox: Gerrit: Re add certificate for its-phabricator temporary [labs/private] - 10https://gerrit.wikimedia.org/r/388430 [12:34:42] (03CR) 10Dzahn: [V: 032 C: 032] Gerrit: Re add certificate for its-phabricator temporary [labs/private] - 10https://gerrit.wikimedia.org/r/388430 (owner: 10Paladox) [12:35:57] (03PS2) 10Marostegui: wikireplicas: Add index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/386973 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis) [12:37:11] (03CR) 10Marostegui: [C: 032] wikireplicas: Add index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/386973 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis) [12:39:12] (03PS1) 10Marostegui: Revert "wikireplicas: Add index for page_props.pp_value" [puppet] - 10https://gerrit.wikimedia.org/r/388431 [12:39:55] (03CR) 10Marostegui: [C: 032] Revert "wikireplicas: Add index for page_props.pp_value" [puppet] - 10https://gerrit.wikimedia.org/r/388431 (owner: 10Marostegui) [12:41:01] PROBLEM - puppet last run on mc2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:41:56] !log uploaded openjdk 8u151-b12 to apt.wikimedia.org/jessie-wikimedia [12:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:05] (03PS13) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [12:48:25] (03PS1) 10BBlack: certs: add globalsign-2017 to deployed set [puppet] - 10https://gerrit.wikimedia.org/r/388432 [12:49:21] (03PS14) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [12:49:33] (03CR) 10BBlack: [C: 032] certs: add globalsign-2017 to deployed set [puppet] - 10https://gerrit.wikimedia.org/r/388432 (owner: 10BBlack) [12:57:16] (03CR) 10Dzahn: [C: 04-2] "yea, so the other related changes are all good but this one isn't actually behind varnish per "monitoring tools should not be cached" rule" [puppet] - 10https://gerrit.wikimedia.org/r/366519 (owner: 10Muehlenhoff) [13:03:52] (03CR) 10Dzahn: [C: 031] "the same change works and is now used on: transparency.wm.org, annual.wm.org, releases.wm.org, static-bugzilla, RT, .. good to go, adding " [puppet] - 10https://gerrit.wikimedia.org/r/366858 (owner: 10Muehlenhoff) [13:05:19] (03CR) 10Dzahn: [C: 031] "the same change works and is now used on: transparency.wm.org, annual.wm.org, releases.wm.org, static-bugzilla, RT." [puppet] - 10https://gerrit.wikimedia.org/r/366811 (owner: 10Muehlenhoff) [13:05:34] (03PS1) 10BBlack: unified certs: shift year to hieradata for easier renewals [puppet] - 10https://gerrit.wikimedia.org/r/388438 [13:05:36] (03PS1) 10BBlack: cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 [13:06:01] RECOVERY - puppet last run on mc2029 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:06:26] ACKNOWLEDGEMENT - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Elukey T178742 [13:11:16] (03PS1) 10BBlack: add dummy GS-2017 keys [labs/private] - 10https://gerrit.wikimedia.org/r/388440 [13:11:32] (03CR) 10BBlack: [V: 032 C: 032] add dummy GS-2017 keys [labs/private] - 10https://gerrit.wikimedia.org/r/388440 (owner: 10BBlack) [13:11:55] (03CR) 10Dzahn: "@chasemp Well, i didn't make up the regex, i just found it and using it for more things. So it has to be maintained already. Or would you " [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [13:12:54] (03CR) 10Dzahn: "do you see any issue with this?" [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [13:17:12] (03CR) 10BBlack: [C: 032] unified certs: shift year to hieradata for easier renewals [puppet] - 10https://gerrit.wikimedia.org/r/388438 (owner: 10BBlack) [13:24:50] (03PS2) 10BBlack: cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 [13:25:01] RECOVERY - Long running screen/tmux on mwlog1001 is OK: OK: No SCREEN or tmux processes detected. [13:26:28] (03PS2) 10Dzahn: Drop unused role::ci::jenkins_access [puppet] - 10https://gerrit.wikimedia.org/r/386811 (owner: 10Hashar) [13:28:22] (03CR) 10Dzahn: [C: 032] Drop unused role::ci::jenkins_access [puppet] - 10https://gerrit.wikimedia.org/r/386811 (owner: 10Hashar) [13:30:16] (03CR) 10BBlack: [C: 032] cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 (owner: 10BBlack) [13:30:20] (03PS3) 10BBlack: cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 [13:31:21] (03PS2) 10Dzahn: graphite: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385478 (owner: 10Hashar) [13:32:02] (03CR) 10Dzahn: [C: 032] graphite: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385478 (owner: 10Hashar) [13:33:47] !log restart electron render service [13:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:26] (03PS4) 10BBlack: cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 [13:34:33] (03CR) 10BBlack: [V: 032 C: 032] cp1008: switch unified to GS-2017 for testing [puppet] - 10https://gerrit.wikimedia.org/r/388439 (owner: 10BBlack) [13:35:28] !log ppchelko@tin Started restart [electron-render/deploy@8dd5f13]: Restarting, service is stuck [13:35:28] (03PS3) 10Dzahn: Migrate contint::firewall to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385472 (owner: 10Hashar) [13:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:27] (03Abandoned) 10Muehlenhoff: Restrict HTTP access in role::librenms [puppet] - 10https://gerrit.wikimedia.org/r/366519 (owner: 10Muehlenhoff) [13:42:40] (03CR) 10Dzahn: "let's use CACHE_MISC here now, right?" [puppet] - 10https://gerrit.wikimedia.org/r/366521 (owner: 10Muehlenhoff) [13:42:56] mutante: thx :) [13:44:07] (03CR) 10Hashar: [C: 04-1] "Untested and probably need a rebase." [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) (owner: 10Hashar) [13:50:29] (03CR) 10Dzahn: [C: 04-1] Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/366521 (owner: 10Muehlenhoff) [13:50:34] !log installing heimdal security updates on trusty [13:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] (03CR) 10Dzahn: [C: 032] Migrate contint::firewall to a profile [puppet] - 10https://gerrit.wikimedia.org/r/385472 (owner: 10Hashar) [13:51:15] hashar: :) [13:53:18] (03CR) 10Dzahn: "no-op on contint1001/2001 :)" [puppet] - 10https://gerrit.wikimedia.org/r/385472 (owner: 10Hashar) [13:53:51] (03PS4) 10Dzahn: contint: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385476 (owner: 10Hashar) [13:55:07] (03CR) 10Dzahn: [C: 032] contint: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385476 (owner: 10Hashar) [14:12:41] !log Upgrading packages on contint1001 / contint2001 [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:38] (03PS2) 10Dzahn: memcached: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382922 (https://phabricator.wikimedia.org/T177225) [14:16:22] !log restarting Jenkins on contint1001 [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:41] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [14:19:45] !log installing java sec updates / cassandra restarts on xenon/praseodymium (cerium already done earlier) [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:41] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [14:28:00] (03Draft1) 10Paladox: Gerrit: fix wrong user syntax for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/388446 [14:28:02] (03PS2) 10Paladox: Gerrit: fix wrong user syntax for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/388446 [14:28:50] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3733180 (10hashar) [14:31:29] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3733181 (10hashar) It has been upgraded: ``` docker-registry.wikimedia.org/wikimedia-jessie latest a81cc7ec7... [14:31:36] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3733182 (10hashar) 05Open>03Resolved [14:31:43] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-EventLogging: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3730724 (10herron) Hello, membership to group `researchers` would provide access to `stat1006` but not `stat1005`. Could you ple... [14:35:27] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3733188 (10phuedx) >>! In T178570#3698689, @Joe wrote: > - How do we download chromium in the fist place in a verifiable way?... [14:37:22] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review: wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3733193 (10akosiaris) p:05Triage>03Normal @Legoktm, I am re-reading the task an... [14:38:17] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-EventLogging: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3730724 (10Ottomata) deployment-eventlog02 access is handled by cloud services and/or release engineering folks, one of them can... [14:38:45] (03CR) 10Paladox: "This follows up to https://gerrit.wikimedia.org/r/#/c/384901/ which accidentally changed username to user. (Didn’t realise at the time unt" [puppet] - 10https://gerrit.wikimedia.org/r/388446 (owner: 10Paladox) [14:39:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] oresweb: Switch to $CACHE_MISC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/366811 (owner: 10Muehlenhoff) [14:42:08] (03CR) 10Dzahn: "haha, that's exactly what i wanted to confirm earlier" [puppet] - 10https://gerrit.wikimedia.org/r/388446 (owner: 10Paladox) [14:43:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.3.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/388453 [14:45:18] (03CR) 10Dzahn: [C: 032] Gerrit: fix wrong user syntax for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/388446 (owner: 10Paladox) [14:48:08] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v1.3.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/388453 (owner: 10Volans) [14:50:30] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.3.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/388453 (owner: 10Volans) [14:50:43] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v1.3.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/388453 (owner: 10Volans) [14:54:16] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3733240 (10Joe) @phuedx no, getting the headers while you are downloading would not be enough, you would need to supply your s... [15:01:18] 10Operations, 10MediaWiki-Containers, 10Continuous-Integration-Infrastructure (shipyard): Create UI for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696#3733282 (10Addshore) [15:01:34] 10Operations, 10MediaWiki-Containers, 10Continuous-Integration-Infrastructure (shipyard): UI for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696#3733294 (10Addshore) [15:07:05] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3733326 (10Joe) Please note that my biggest concern here is the security one. Citing myself: > * How do we ensure security u... [15:07:32] (03PS1) 10Volans: Upstream release v1.3.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/388456 [15:12:07] (03CR) 10Volans: [C: 032] Upstream release v1.3.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/388456 (owner: 10Volans) [15:14:42] (03Merged) 10jenkins-bot: Upstream release v1.3.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/388456 (owner: 10Volans) [15:15:59] (03CR) 10jenkins-bot: Upstream release v1.3.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/388456 (owner: 10Volans) [15:28:11] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3733373 (10MattFitzpatrick) [15:30:33] (03PS1) 10Muehlenhoff: Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/388461 [15:32:00] (03Abandoned) 10Muehlenhoff: Restrict HTTP access for racktables [puppet] - 10https://gerrit.wikimedia.org/r/366521 (owner: 10Muehlenhoff) [15:33:57] (03PS2) 10Ayounsi: Netbox scap3 initial commit [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 [15:35:21] !log uploaded cumin_1.3.0-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [15:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:21] (03PS1) 10Ayounsi: Scap: add codfw host to the list of librenms hosts [puppet] - 10https://gerrit.wikimedia.org/r/388462 [15:42:58] (03PS2) 10Alexandros Kosiaris: kubernetes: Enable RBAC in production [puppet] - 10https://gerrit.wikimedia.org/r/388122 (https://phabricator.wikimedia.org/T177393) [15:43:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Enable RBAC in production [puppet] - 10https://gerrit.wikimedia.org/r/388122 (https://phabricator.wikimedia.org/T177393) (owner: 10Alexandros Kosiaris) [15:44:04] (03PS1) 10ArielGlenn: cleanup old dumps on dumps web servers, part one [puppet] - 10https://gerrit.wikimedia.org/r/388467 (https://phabricator.wikimedia.org/T178893) [15:49:55] !log T177393 enable RBAC for kubernetes in production and staging [15:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:01] T177393: Implement authentication/authorization in Kubernetes clusters - https://phabricator.wikimedia.org/T177393 [15:50:30] !log powering off labvirt1015 to swap CPU (again) [15:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:01] on pt.wikipedia, users are reporting that someone is trying to access their accounts, is anyone looking at it? [15:52:32] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:35] (03PS2) 10Filippo Giunchedi: mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) [15:56:37] (03PS1) 10Filippo Giunchedi: mtail: add test scaffolding [puppet] - 10https://gerrit.wikimedia.org/r/388478 (https://phabricator.wikimedia.org/T179565) [15:59:17] (03PS2) 10ArielGlenn: cleanup old dumps on dumps web servers, part one [puppet] - 10https://gerrit.wikimedia.org/r/388467 (https://phabricator.wikimedia.org/T178893) [16:00:11] PROBLEM - Host labvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:09] (03PS1) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [16:02:35] (03CR) 10jerkins-bot: [V: 04-1] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [16:03:07] !log demon@tin Synchronized php-1.31.0-wmf.6/extensions/ContentTranslation/api/ApiQueryTranslatorStats.php: fix warnings about bad min() calls (duration: 00m 47s) [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:17] (03PS2) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [16:05:21] RECOVERY - Host labvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [16:09:08] (03PS3) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [16:12:56] (03PS4) 10Ema: cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) [16:13:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3733537 (10Cmjohnson) @chasemp please try again, I replaced the broken CPU. [16:17:41] !log Deploy alter table on db2044 - T174569 [16:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:49] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:21:18] (03Abandoned) 10Chad: Future-proof against changing import semantics in python 3 [software/conftool] - 10https://gerrit.wikimedia.org/r/387280 (owner: 10Chad) [16:21:22] (03Abandoned) 10Chad: Pylint nitpicks: Whitespace/continuation/parens fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/387313 (owner: 10Chad) [16:21:24] (03Abandoned) 10Chad: Add .coverage to .gitignore [software/conftool] - 10https://gerrit.wikimedia.org/r/387309 (owner: 10Chad) [16:21:27] (03Abandoned) 10Chad: py3 compat: rewrite execfile() using exec() [software/conftool] - 10https://gerrit.wikimedia.org/r/387305 (owner: 10Chad) [16:21:31] (03Abandoned) 10Chad: Future-proof python 3 print function [software/conftool] - 10https://gerrit.wikimedia.org/r/387297 (owner: 10Chad) [16:21:37] (03Abandoned) 10Chad: Py3 compat: swap xrange for range [software/conftool] - 10https://gerrit.wikimedia.org/r/387306 (owner: 10Chad) [16:21:46] (03Abandoned) 10Chad: Fix StringIO import for py2/3 compat [software/conftool] - 10https://gerrit.wikimedia.org/r/387322 (owner: 10Chad) [16:22:02] (03Abandoned) 10Chad: Fix a ton of errors that were making flake8 freak out [software/conftool] - 10https://gerrit.wikimedia.org/r/387279 (owner: 10Chad) [16:22:41] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:24:50] herron: when you get a chance I'd like your opinion on https://gerrit.wikimedia.org/r/#/c/388032/ and https://gerrit.wikimedia.org/r/#/c/388478/ [16:25:45] godog cool will check it out in a little bit [16:27:45] herron: sweet, thanks! [16:30:25] (03CR) 10Thcipriani: [C: 031] "Looks good from the scap side." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 (owner: 10Ayounsi) [16:33:55] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/8632/" [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [16:50:39] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3733622 (10RobH) [16:50:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3733619 (10RobH) 05Open>03Resolved Whoever went ahead and started the steps marked 'non interruptable' and skipped the switch port disable, please do not do... [16:52:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3733624 (10bd808) How to run a stress test: ``` $ ssh labcontrol1001.wikimedia.org $ source <(sudo cat ~root/novaenv.sh) $ nova list --tenant=testlabs | grep labvirt1015stres... [16:52:24] (03PS1) 10Filippo Giunchedi: prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) [16:52:54] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add jobs to scrape metrics from k8s [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [16:54:41] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [16:55:40] (03PS3) 10ArielGlenn: cleanup old dumps on dumps web servers, part one [puppet] - 10https://gerrit.wikimedia.org/r/388467 (https://phabricator.wikimedia.org/T178893) [17:03:31] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 579 bytes in 0.023 second response time [17:04:23] (03PS1) 10Muehlenhoff: otrs: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388506 [17:04:25] (03PS1) 10Muehlenhoff: role::wikimania_scholarships: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388507 [17:04:27] (03PS1) 10Muehlenhoff: role::microsites::peopleweb: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388508 [17:04:29] (03PS1) 10Muehlenhoff: profile::planet::venus: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388509 [17:06:18] (03PS4) 10ArielGlenn: cleanup old dumps on dumps web servers, part one [puppet] - 10https://gerrit.wikimedia.org/r/388467 (https://phabricator.wikimedia.org/T178893) [17:06:38] (03PS1) 10Muehlenhoff: statistics::sites::pivot: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388510 [17:07:29] (03CR) 10ArielGlenn: [C: 032] cleanup old dumps on dumps web servers, part one [puppet] - 10https://gerrit.wikimedia.org/r/388467 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [17:08:55] (03PS1) 10Muehlenhoff: role::servermon::wmf: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388512 [17:10:15] (03PS1) 10Muehlenhoff: etherpad: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388513 [17:12:41] (03PS1) 10ArielGlenn: fix up paths for wiki lists on dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/388514 [17:13:34] (03CR) 10ArielGlenn: [C: 032] fix up paths for wiki lists on dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/388514 (owner: 10ArielGlenn) [17:15:36] (03CR) 10Filippo Giunchedi: [C: 031] cache: send varnish logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/388482 (https://phabricator.wikimedia.org/T63782) (owner: 10Ema) [17:18:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3733680 (10RobH) [17:19:35] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3600184 (10RobH) [17:19:46] (03CR) 10Filippo Giunchedi: "Note I haven't tested the configuration so it'll likely need to be adjusted" [puppet] - 10https://gerrit.wikimedia.org/r/388505 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [17:22:51] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3696477 (10greg) >>! In T178570#3733240, @Joe wrote: > I would ask advice to the #release-engineering-team on how to distribut... [17:23:29] (03PS3) 10Ayounsi: Netbox scap3 initial commit [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 [17:24:59] (03CR) 10Ayounsi: [V: 032 C: 032] Netbox scap3 initial commit [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 (owner: 10Ayounsi) [17:41:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:43:41] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [17:44:18] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-EventLogging: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3733790 (10Sharvaniharan) @Ottomata access for only "researchers" will be sufficient for now. Thank you for updating the ticket... [17:45:41] cp1045.. [17:46:47] (03CR) 10BryanDavis: Netbox scap3 initial commit (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/387861 (owner: 10Ayounsi) [17:47:02] better ores backend throwing 500s, recovered [17:47:17] ? [17:47:47] (03Abandoned) 10Herron: WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [17:48:10] (03PS1) 10Herron: puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) [17:48:22] halfak: there were some 500s returned by Ores causing alarms (the ones above) [17:48:47] halfak: https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [17:48:49] (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [17:49:42] Weird. We're not seeing those in changeprop. [17:49:47] I'll look into the requests. Thanks elenah [17:49:50] *elukey [17:49:56] sorry! el[tab] :( [17:51:30] FWIW the oxygen logs say all those 500s came from uri_path /scores/frwiki/damaging/ [17:54:13] Right. I worked with elukey to figure out that actually scoring requests are not degraded. [17:54:23] It's just info requests -- probably from a bot on frwiki. [17:54:36] And these info requests are not formatted correctly for a change we made. [17:54:46] I'll be figuring out how to change the majority of these 500s to 400s [17:55:00] sounds awesome :) [17:55:02] From ORES point of view, there's not a bunch of urgency. [17:55:21] We should be able to get some fixes deployed in the Monday service window. [17:55:38] elukey & bblack, sound reasonable? [17:55:40] (03PS2) 10Herron: puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) [17:55:42] yup [17:55:44] cool [17:55:48] Thanks for the ping/help [17:55:49] :) [17:56:11] it might be weird to have the 50x alarm red though [17:56:23] (03CR) 10jerkins-bot: [V: 04-1] puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [17:56:25] I am afraid that it could mask some other issue [17:56:47] !log demon@tin Synchronized php-1.31.0-wmf.6/includes/libs/rdbms/database/DatabaseMysqlBase.php: logging (duration: 00m 47s) [17:56:49] bblack: too paranoid? :D [17:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:02] I hope that the bot at some point will stop [17:57:04] well, for better or worse (worse, probably), we don't treat cache_misc with quite as much urgency as the same 5xx's on text or upload :) [17:57:37] all right so if Eqiad HTTP 5xx reqs/min [17:57:49] is red and something else other than misc is, trouble :) [17:58:01] elukey, I think it is Sailbot. I could reach out to the maintainer. Alternatively I could do an emergency deploy to turn these into 400s [17:58:12] eventually we're going to have to deal with the "SLA" sort of issue, even if less-formally than a real SLA, at the level of monitoring/alerting for all these backends. [17:58:32] halfak: nono all good, better not to risk a friday deployment to ores for this [17:58:32] Possibly related to ORES above: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 [17:58:35] Had a spike of those ^ [17:58:47] (could use a patch to defensively code against that possibility in the MW extension) [17:59:37] cache_text and cache_upload have on the order of 2-6 applications behind each, depending how you count, and they're all generally considered a big problem when they go bad. [18:00:04] cache_misc used to only have what we'd think of as "internal" meta-stuff, but over time its use-cases have grown substantially [18:00:38] elukey, OK agreed. [18:00:45] it now has 33 different applications behind it, and they vary considerably in every dimension (traffic-level, importance/"SLA", mostly-internal-use vs public, etc) [18:01:10] and we haven't really done all we should about that [18:02:22] in the long term one of the upcoming plans is to merge the misc+text clusters too, but then we'll probably also want some metadata to flag the relative importance/severity/alerting/etc of different classes of backends.... [18:02:40] (and differentiate them better in graphing maybe, too) [18:04:51] PROBLEM - puppet last run on db2076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:52] (03PS3) 10Herron: puppet: add puppet_major_version variable [puppet] - 10https://gerrit.wikimedia.org/r/388538 (https://phabricator.wikimedia.org/T178825) [18:06:18] (03PS6) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [18:06:21] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [18:06:49] all right going afk, thanks all for the follow up :) [18:07:46] (03PS1) 10Volans: CHANGELOG: fix formatting. [software/cumin] - 10https://gerrit.wikimedia.org/r/388540 [18:10:09] (03CR) 10Hashar: [C: 031] "Looks like Sam Smith was tweaking that for T171853. But it seems the A/B testing is done or figured out. So most probably beta cluster ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) (owner: 10Jdlrobson) [18:11:52] (03CR) 10Hashar: "\O/" [puppet] - 10https://gerrit.wikimedia.org/r/385472 (owner: 10Hashar) [18:13:23] (03CR) 10BBlack: [C: 04-1] VCL: log TLS information to VSM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [18:15:46] (03PS4) 10Ema: VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) [18:17:14] (03CR) 10Ema: VCL: log TLS information to VSM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [18:19:02] (03CR) 10Volans: [C: 032] CHANGELOG: fix formatting. [software/cumin] - 10https://gerrit.wikimedia.org/r/388540 (owner: 10Volans) [18:21:29] (03PS1) 10ArielGlenn: script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) [18:21:31] (03Merged) 10jenkins-bot: CHANGELOG: fix formatting. [software/cumin] - 10https://gerrit.wikimedia.org/r/388540 (owner: 10Volans) [18:21:39] 10Operations, 10MediaWiki-Containers, 10Continuous-Integration-Infrastructure (shipyard): UI for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696#3733282 (10hashar) https://docker-registry.wikimedia.org/v2/_catalog ``` lang=json { "repositories": [ "alpine",... [18:21:47] (03CR) 10jenkins-bot: CHANGELOG: fix formatting. [software/cumin] - 10https://gerrit.wikimedia.org/r/388540 (owner: 10Volans) [18:21:58] (03CR) 10jerkins-bot: [V: 04-1] script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [18:24:53] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review: wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3733948 (10hashar) `apt-get build-dep hhvm` is for the jobs that build the PHP exte... [18:24:55] (03PS2) 10ArielGlenn: script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) [18:26:53] (03CR) 10Herron: "Looks good to me overall. Is it difficult to adjust counters later down the road? Say we wanted to count individual causes for 'rejected " [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [18:34:52] RECOVERY - puppet last run on db2076 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:36:21] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [18:37:54] (03PS1) 10Herron: admin: add sharvaniharan to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/388545 (https://phabricator.wikimedia.org/T179611) [18:42:49] (03PS3) 10ArielGlenn: script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) [18:46:27] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3734020 (10jrbs) I'm sorry. I have no real idea how to reach devs for deployment here. This wiki would be really useful to have right n... [18:50:41] (03PS3) 10Herron: puppet: add puppet-master.conf to avoid conflict at pkg install time [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) [18:51:49] !log whisper-mass-resize completed for graphite1001.eqiad.wmnet:/var/lib/carbon/whisper/frontend/navtiming (at 01:34 UTC actually) [18:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:13] !log Starting whisper-mass-resize for frontend.navtiming on graphite2001 (T179622) [18:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:19] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [18:58:04] (03CR) 10Herron: puppet: add puppet-master.conf to avoid conflict at pkg install time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386696 (https://phabricator.wikimedia.org/T179102) (owner: 10Herron) [18:58:18] (03PS4) 10ArielGlenn: script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) [19:01:27] (03PS7) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [19:09:41] 10Operations, 10Puppet, 10User-Joe: puppet4: puppet master passenger apache backend config changes - https://phabricator.wikimedia.org/T179720#3734093 (10herron) [19:13:34] 10Operations, 10Puppet, 10User-Joe: puppet4: The following unknown setting(s) are being ignored: parser - https://phabricator.wikimedia.org/T179721#3734109 (10herron) [19:14:10] (03PS8) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [19:18:01] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3734131 (10Reedy) a:03Reedy >>! In T174370#3734020, @jrbs wrote: > I'm sorry. I have no real idea how to reach devs for deployment he... [19:21:39] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3734134 (10jrbs) >>! In T174370#3734131, @Reedy wrote: >>>! In T174370#3734020, @jrbs wrote: >> I'm sorry. I have no real idea how to r... [19:21:44] 10Operations, 10Puppet, 10User-Joe: puppet4: puppet master auth.conf changes - https://phabricator.wikimedia.org/T179722#3734135 (10herron) [19:26:21] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [19:32:21] 10Operations, 10Puppet, 10User-Joe: puppet4: conditionally pin puppet* packages to the appropriate repo for OS release - https://phabricator.wikimedia.org/T179724#3734181 (10herron) [19:34:12] (03PS1) 10BryanDavis: wikireplicas: Add partial index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/388572 (https://phabricator.wikimedia.org/T140609) [19:41:07] (03PS1) 10Ayounsi: Init submodules [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388574 [19:41:18] (03CR) 10Ayounsi: [V: 032 C: 032] Init submodules [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388574 (owner: 10Ayounsi) [19:42:22] PROBLEM - Apache HTTP on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:02] (03PS5) 10ArielGlenn: script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) [19:43:21] RECOVERY - Apache HTTP on mw2123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.123 second response time [19:45:03] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3734211 (10herron) [19:45:05] 10Operations, 10Puppet, 10Patch-For-Review: Puppet4: Create empty/placeholder /etc/apache2/sites-enabled/puppet-master.conf - https://phabricator.wikimedia.org/T179102#3734210 (10herron) [19:45:40] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [19:45:42] 10Operations, 10Puppet, 10Patch-For-Review: Puppet4: Create empty/placeholder /etc/apache2/sites-enabled/puppet-master.conf - https://phabricator.wikimedia.org/T179102#3713307 (10herron) [19:46:07] 10Operations, 10Puppet, 10User-Joe: puppet4: puppet master passenger apache backend config changes - https://phabricator.wikimedia.org/T179720#3734215 (10herron) [19:46:09] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [19:46:27] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [19:46:30] 10Operations, 10Puppet, 10User-Joe: puppet4: puppet master auth.conf changes - https://phabricator.wikimedia.org/T179722#3734217 (10herron) [19:46:45] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version installation - https://phabricator.wikimedia.org/T178825#3704234 (10herron) [19:46:47] 10Operations, 10Puppet, 10User-Joe: puppet4: conditionally pin puppet* packages to the appropriate repo for OS release - https://phabricator.wikimedia.org/T179724#3734219 (10herron) [19:47:54] 10Operations, 10Puppet, 10Patch-For-Review: Granular puppet version selection - https://phabricator.wikimedia.org/T178825#3734222 (10herron) [19:56:21] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [19:57:16] (03PS1) 10Ayounsi: Update checks scripts to reflect netbox location [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388575 [19:57:37] (03CR) 10Ayounsi: [V: 032 C: 032] Update checks scripts to reflect netbox location [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388575 (owner: 10Ayounsi) [20:00:34] (03PS9) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [20:03:11] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [20:03:12] ACKNOWLEDGEMENT - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T179727 [20:03:15] 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3734260 (10ops-monitoring-bot) [20:06:21] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [20:21:05] 10Operations, 10Performance-Team: Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734285 (10Krinkle) [20:25:29] 10Operations, 10Performance-Team: Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734307 (10Krinkle) [20:25:46] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) [20:25:56] 10Operations, 10Performance-Team: Create perf-team shell group - https://phabricator.wikimedia.org/T179728#3734285 (10Krinkle) [20:25:58] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734325 (10Krinkle) [20:27:11] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) [20:28:18] !log Deploy alter table db2058 - T174569 [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:24] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [20:30:00] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734335 (10Krinkle) [20:30:07] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) [20:30:24] 10Operations, 10Ops-Access-Requests: Requesting access to perf-teams for phedenskog - https://phabricator.wikimedia.org/T179729#3734308 (10Krinkle) I approve, naturally :-) [20:33:56] (03CR) 10ArielGlenn: [C: 032] script for cleaning up old dumps on the web servers [puppet] - 10https://gerrit.wikimedia.org/r/388541 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [20:36:21] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [20:41:01] PROBLEM - Check systemd state on ms1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:46:01] RECOVERY - Check systemd state on ms1001 is OK: OK - running: The system is fully operational [20:53:11] PROBLEM - Squid on install1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:55] (03CR) 10jenkins-bot: CHANGELOG: fix formatting. [software/cumin] - 10https://gerrit.wikimedia.org/r/388540 (owner: 10Volans) [20:57:02] RECOVERY - Squid on install1002 is OK: TCP OK - 0.011 second response time on 208.80.154.22 port 8080 [21:26:21] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:12] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 76556 bytes in 0.576 second response time [21:30:58] (03PS1) 10Ayounsi: Remove old migrate line [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388622 [21:31:11] (03CR) 10Ayounsi: [V: 032 C: 032] Remove old migrate line [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/388622 (owner: 10Ayounsi) [21:40:51] (03PS1) 10ArielGlenn: enable rsyncs to dataset1001 from dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/388644 [21:42:23] (03CR) 10ArielGlenn: [C: 032] enable rsyncs to dataset1001 from dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/388644 (owner: 10ArielGlenn) [21:53:21] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:12] RECOVERY - HHVM rendering on mw2120 is OK: HTTP OK: HTTP/1.1 200 OK - 76550 bytes in 0.320 second response time [21:57:41] (03PS10) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [22:46:21] RECOVERY - MegaRAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [22:53:54] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3331800 (10Tgr) Preventing WP0 users from doing file patrolling seems like an acceptable level of collateral damage (maybe with a wiki wh... [23:16:21] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [23:35:09] (03PS1) 10Smalyshev: Report 429s to logstash too [puppet] - 10https://gerrit.wikimedia.org/r/388696 (https://phabricator.wikimedia.org/T178533) [23:56:03] (03PS11) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144)