[00:05:28] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:33:28] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [00:41:18] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:09:18] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [02:16:59] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 06m 11s) [02:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 30 02:21:21 UTC 2017 (duration 4m 22s) [02:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:48] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:58:48] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:00:38] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1814.10332 Seconds [03:06:28] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 44.669994 Seconds [03:21:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.13 seconds [03:26:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 253.20 seconds [03:28:38] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [03:59:50] 06Operations, 10Gerrit, 06Release-Engineering-Team: Enable the git:// protocole on gerrit - https://phabricator.wikimedia.org/T156597#2980855 (10demon) 05Open>03declined There's zero reason to do this and just adds complexity we'll have to support. [04:03:28] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:17:28] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:31:28] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:45:28] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:14:37] (03PS1) 10Legoktm: toollabs: Install mktorrent [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) [06:01:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.876 second response time [06:02:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.807 second response time [06:03:59] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.613 second response time [06:06:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.867 second response time [06:07:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.308 second response time [06:09:59] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.307 second response time [06:25:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.203 second response time [06:28:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.765 second response time [06:29:59] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.190 second response time [06:30:59] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.533 second response time [06:31:39] PROBLEM - Host mw1236 is DOWN: PING CRITICAL - Packet loss = 100% [06:44:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.159 second response time [06:45:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.748 second response time [06:48:19] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:53:49] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:05] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.190 second response time [06:58:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.805 second response time [07:03:09] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2980969 (10Joe) >>! In T149617#2977429, @Legoktm wrote: > @joe could you also upload the slides... [07:20:05] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2358 [07:22:44] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1072.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage... [07:24:05] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.426 second response time [07:24:54] _joe_: Re: Dynamic configuration - I assume this precludes any form of credentials being dynamic [07:25:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.428 second response time [07:25:05] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 312777 Threads: 1 Questions: 3097182 Slow queries: 1425 Opens: 2119 Flush tables: 1 Open tables: 561 Queries per second avg: 9.902 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:26:09] <_joe_> friendly12345: credentials are configuration, not state. Hence, it's not suited for being set at runtime [07:26:15] <_joe_> friendly12345: https://commons.wikimedia.org/wiki/File:Integrating_MediaWiki_(and_other_services)_with_dynamic_configuration.webm :) [07:29:05] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.287 second response time [07:29:11] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2980984 (10Gilles) [07:29:55] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.254 second response time [07:30:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.781 second response time [07:30:55] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.388 second response time [07:31:09] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2980985 (10Marostegui) I have upgraded db2012 to 10.0.29-2 (actually done a full upgrade) and the A... [07:35:37] (03CR) 10Muehlenhoff: [C: 032] Record CVE ID fixed in earlier 4.4.x kernel [debs/linux44] - 10https://gerrit.wikimedia.org/r/334666 (owner: 10Muehlenhoff) [07:35:55] (03PS3) 10Muehlenhoff: Switch app servers in codfw to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/332976 (https://phabricator.wikimedia.org/T150257) [07:40:22] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2980990 (10Joe) https://commons.wikimedia.org/wiki/File:Asynchronous_processing_on_the_WMF_cluster.pdf is the uploaded file. [07:46:28] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980992 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1072.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['db1072.eqiad.wmnet']) ``` [07:47:09] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2980993 (10Joe) 05Open>03Resolved [07:50:43] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980995 (10Marostegui) >>! In T156226#2980992, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ``` > ['db1072.eqiad.wmnet'] > ``` > > Of which those **FAILED**: > ``` > set(['db1072.... [07:52:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334973 (https://phabricator.wikimedia.org/T156226) [07:55:07] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980998 (10Marostegui) And from the reimage command outout: ``` sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet START To monitor the full log: tail -F /var/log/wmf-auto-reimage/201701300722_m... [07:56:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334973 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [07:57:05] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334973 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [07:58:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334973 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [08:01:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T156226 (duration: 02m 45s) [08:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:12] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [08:02:20] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2981020 (10Joe) >>! In T149617#2980969, @Joe wrote: >>>! In T149617#2977429, @Legoktm wrote: >>... [08:03:35] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10MoritzMuehlenhoff) I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of w... [08:03:55] (03CR) 10Muehlenhoff: [C: 032] Switch app servers in codfw to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/332976 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [08:04:53] !log Stop mysql db1073 to use it to clone db1072 - T156226 [08:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:42] !log switched application servers in codfw to systemd-timesyncd [08:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:43] (03CR) 10Nemo bis: dumps: Modernize design of the index page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [08:08:46] mw1236.eqiad.wmnet down? [08:10:17] racadm serveraction powerstatus [08:10:17] Server power status: OFF [08:10:18] indeed [08:10:45] nothing on the SAL afaics [08:10:54] yeah, there is nothing there [08:11:24] probably better to depool it explictly [08:11:32] so it will not be part of scap dsh [08:12:07] never done that before :) [08:12:19] doing it now :) [08:12:26] oh, thanks! [08:12:33] maybe you can teach me how to? [08:12:55] sure sure, I was going to copy paste commands [08:13:01] checked its status [08:13:01] <3 [08:13:02] confctl --quiet select 'name=mw1236.eqiad.wmnet' get [08:13:05] then [08:13:28] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2981031 (10Marostegui) >>! In T156226#2981023, @MoritzMuehlenhoff wrote: > I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key an... [08:13:33] confctl --quiet select 'name=mw1236.eqiad.wmnet' --action set/pooled=inactive [08:13:42] (maybe without --quiet) [08:13:47] (so it will log) [08:14:03] and finally puppet run on tin to make sure that scap's dsh is updated [08:14:22] no sorry the inactive command is wrong :P [08:14:24] let me chck [08:14:28] haha :) [08:14:59] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2981032 (10Gilles) Accept headers and Vary: Accept are missing from the current task description. Standardizing entry point to thumb.php (and I guess... [08:15:27] confctl --quiet select 'name=mw1236.eqiad.wmnet' set/pooled=inactive [08:15:31] no --action [08:15:33] bad history [08:15:33] :) [08:15:58] noted! :) [08:15:59] thanks [08:19:06] !log set mw1236.eqiad.wmnet pooled=inactive because powered off (no mentions on the SAL, still trying to find why) [08:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:52] (03CR) 10Muehlenhoff: [C: 031] "Looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/334719 (https://phabricator.wikimedia.org/T156529) (owner: 10Dzahn) [08:22:14] ah there you go "07:31 PROBLEM - Host mw1236 is DOWN: PING CRITICAL - Packet loss = 100%" [08:22:19] went down a couple of hours ago [08:25:05] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:25:44] let's see if it powersup [08:26:27] ERROR: Timeout while waiting for server to perform requested power action. [08:26:44] opening a phab task.. [08:28:34] thanks :) [08:30:01] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#2981044 (10elukey) [08:30:05] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:09] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#2981057 (10elukey) p:05Triage>03Normal [08:35:05] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.450 second response time [08:36:05] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.583 second response time [08:45:27] !log restarting aqs on aqs100[4567] to pick up NSS updates [08:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:05] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.479 second response time [08:52:05] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.605 second response time [08:54:39] !log installing NSS security updates on kafka and Hadoop clusters [08:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:12] (03PS7) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [08:58:07] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:03:34] (03PS1) 10Zhuyifei1999: kubernetesbackend: change absolute kubectl path to '/usr/bin/kubectl' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) [09:05:47] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:05:51] !log Start slaves from s1 to s7 on dbstore2001 - T156373 [09:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:57] T156373: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373 [09:06:18] !log Upgrade db2012 to 10.0.29-2 (this was done couple of hours ago, but for the record) - T156373 [09:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:07] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.763 second response time [09:11:28] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782449 (10fgiunchedi) WRT deployment strategy note that ATM all thumb accesses after varnish go through our [[https://github.com/wikimedia/operations-... [09:12:07] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.782 second response time [09:16:20] (03PS2) 10Elukey: Enable AQS aqs1007-b cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/334753 (https://phabricator.wikimedia.org/T155654) [09:17:07] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.025 second response time [09:19:07] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.180 second response time [09:19:28] !log installing tcpdump security updates [09:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.924 second response time [09:25:29] (03CR) 10Elukey: [C: 032] Enable AQS aqs1007-b cassandra instance [puppet] - 10https://gerrit.wikimedia.org/r/334753 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [09:25:51] !log bootstrapping new cassandra instance (aqs1007-b) on AQS - https://gerrit.wikimedia.org/r/#/c/334753/ [09:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:07] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.465 second response time [09:33:16] (03CR) 10ArielGlenn: [C: 032] Make md5sums.txt files compatible with md5sum --check [dumps] - 10https://gerrit.wikimedia.org/r/328219 (https://phabricator.wikimedia.org/T69886) (owner: 10Awight) [09:34:46] (03CR) 10ArielGlenn: [V: 032 C: 032] Make md5sums.txt files compatible with md5sum --check [dumps] - 10https://gerrit.wikimedia.org/r/328219 (https://phabricator.wikimedia.org/T69886) (owner: 10Awight) [09:36:30] (03CR) 10Hashar: "There is a nice crazy change in modules/confluent/manifests/kafka/mirror/instance.pp which I am not sure what would be the end result:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [09:37:07] !log ariel@tin Starting deploy [dumps/dumps@4a9e952]: proper md5sum format for adds/changes dumps [09:37:09] !log ariel@tin Finished deploy [dumps/dumps@4a9e952]: proper md5sum format for adds/changes dumps (duration: 00m 02s) [09:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:07] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [09:42:27] grrr [09:43:19] (03PS1) 10ArielGlenn: Turn off centralauth table dumps. [puppet] - 10https://gerrit.wikimedia.org/r/334985 (https://phabricator.wikimedia.org/T153633) [09:44:07] !log upgrade to thumbor 0.1.33 - T151066 [09:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:12] T151066: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066 [09:47:05] (03CR) 10Hashar: "I have seen it in other changes, I am not a huge fan of adding trailing commas to hash that have a single element. Specially puppet-lint " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [09:48:47] (03PS2) 10Filippo Giunchedi: Revert "Remove broken Thumbor IP throttling from configuration" [puppet] - 10https://gerrit.wikimedia.org/r/334252 (owner: 10Gilles) [09:51:30] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Remove broken Thumbor IP throttling from configuration" [puppet] - 10https://gerrit.wikimedia.org/r/334252 (owner: 10Gilles) [09:51:51] ci is backed up, self +2 [09:52:57] PROBLEM - puppet last run on mc1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [09:54:01] (03PS1) 10Marostegui: db-eqiad.php: Repool db1072,db1073 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) [09:54:24] (03CR) 10Marostegui: [C: 04-2] "wait for the servers to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [09:57:23] (03PS2) 10Marostegui: db-eqiad.php: Repool db1073 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) [09:58:18] (03CR) 10Filippo Giunchedi: [C: 031] Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [09:58:49] _joe_: I had a question for you re: conftool(-data) in ^ [09:58:51] !log upgrade and restart nginx on relforge cluster [09:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:08] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2981357 (10Gilles) I don't think that the existing requirements are to be touched at first. I.e. thumbnails would still be width-based only. Some adap... [09:59:10] (03CR) 10Marostegui: [C: 032] "db1073 caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [10:00:43] !log upgrade and restart nginx on elasticsearch codfw cluster [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] PROBLEM - cassandra-b CQL 10.64.0.237:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.237 and port 9042: Connection refused [10:01:44] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1073 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [10:01:53] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1073 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334987 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [10:03:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 with less weight - T156226 (duration: 00m 49s) [10:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] <_joe_> godog: looking [10:03:55] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [10:06:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [10:07:02] _joe_: thanks! [10:07:03] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:12] silencing aqs1007-b! Sorry for the delay [10:10:53] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [10:11:55] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.237:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.237 and port 9042: Connection refused Elukey bootstrapping cassandra [10:17:52] (03CR) 10Ema: [V: 032 C: 032] etcd.py: log a warning on empty responses from etcd [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [10:18:56] (03PS2) 10Addshore: Add twocolconflict to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) [10:19:06] James_F: ^^ [10:21:03] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:21:18] (03CR) 10Jforrester: ">" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [10:21:28] addshore: Sorry. [10:21:50] James_F: thats what we did for the RevisionSlider ;) [10:22:02] we were simply aiming for the same :P [10:22:11] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#2981440 (10Gilles) [10:22:14] addshore: Yeah, it was wrong then and it's wrong now. [10:23:10] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#2829906 (10Gilles) @fgiunchedi please open read rights for me on the nginx access logs on thumbor100* This way I can look for some kind of pattern in the requests around the time th... [10:23:37] James_F: ack, will check it then! [10:24:15] Sorry, didn't see the gerrit reply from last month until now otherwise I'd have replied then. [10:24:20] Don't want to mess you around. [10:24:44] (03PS2) 10Ema: Use caller function module name as default log prefix [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 [10:27:13] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.326 second response time [10:28:13] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.972 second response time [10:28:38] (03PS1) 10Urbanecm: Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) [10:32:26] (03CR) 10Ema: Use caller function module name as default log prefix (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [10:32:45] (03CR) 10Jforrester: [C: 031] Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) (owner: 10Urbanecm) [10:33:04] (03CR) 10Jforrester: "(Otherwise please consider this a +1 from me.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [10:33:40] (03CR) 10Addshore: "Brilliant!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [10:36:03] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:40:34] (03PS2) 10Ema: cache: remove varnish_version4 from hiera and salt [puppet] - 10https://gerrit.wikimedia.org/r/334043 [10:41:19] (03CR) 10Ema: [V: 032 C: 032] cache: remove varnish_version4 from hiera and salt [puppet] - 10https://gerrit.wikimedia.org/r/334043 (owner: 10Ema) [10:50:12] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough change..." [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [10:56:07] (03CR) 10Muehlenhoff: "Two additional comments. Looking at the rules file, it can be reduced to a few lines by moving to dh, this would simplify things a lot. Se" (032 comments) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [10:56:25] (03PS2) 10ArielGlenn: Turn off centralauth table dumps. [puppet] - 10https://gerrit.wikimedia.org/r/334985 (https://phabricator.wikimedia.org/T153633) [10:57:49] (03CR) 10ArielGlenn: [C: 032] Turn off centralauth table dumps. [puppet] - 10https://gerrit.wikimedia.org/r/334985 (https://phabricator.wikimedia.org/T153633) (owner: 10ArielGlenn) [11:08:57] (03PS6) 10Juniorsys: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) [11:10:11] (03PS5) 10Juniorsys: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) [11:11:13] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.787 second response time [11:12:13] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.135 second response time [11:12:21] (03PS5) 10Juniorsys: deployment: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) [11:20:27] (03CR) 10Hashar: [C: 031] librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [11:23:59] !log upgrade and restart nginx on elasticsearch eqiad cluster [11:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:21] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, and 2 others: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#2981585 (10Joe) The cluster in codfw is installed and tested to work correctly with conftool. The performance of the cluster using nginx as a... [11:32:10] 06Operations, 10Gerrit, 06Release-Engineering-Team: Enable the git:// protocole on gerrit - https://phabricator.wikimedia.org/T156597#2981610 (10Aklapper) > Theres no steps as this is a feature request. For future reference, please always provide **reasons why** you request something. [11:35:33] PROBLEM - Elasticsearch HTTPS on elastic1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:37:03] PROBLEM - Elasticsearch HTTPS on elastic1019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:38:43] PROBLEM - Elasticsearch HTTPS on elastic1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:39:03] RECOVERY - Elasticsearch HTTPS on elastic1019 is OK: SSL OK - Certificate elastic1019.eqiad.wmnet valid until 2021-03-15 20:20:18 +0000 (expires in 1505 days) [11:39:43] PROBLEM - Elasticsearch HTTPS on elastic1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:40:22] (03PS1) 10Muehlenhoff: More email addresses [puppet] - 10https://gerrit.wikimedia.org/r/335005 [11:41:18] (03PS8) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) [11:41:20] (03PS6) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [11:43:03] PROBLEM - Elasticsearch HTTPS on elastic1023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:44:03] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:44:33] RECOVERY - Elasticsearch HTTPS on elastic1018 is OK: SSL OK - Certificate elastic1018.eqiad.wmnet valid until 2021-03-15 20:19:15 +0000 (expires in 1505 days) [11:44:47] (03PS9) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) [11:44:49] (03PS7) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [11:45:02] (03CR) 10Giuseppe Lavagetto: Generalize entities definitions (0310 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [11:45:36] (03CR) 10Muehlenhoff: [C: 032] More email addresses [puppet] - 10https://gerrit.wikimedia.org/r/335005 (owner: 10Muehlenhoff) [11:45:43] PROBLEM - Elasticsearch HTTPS on elastic1025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:49:43] PROBLEM - Elasticsearch HTTPS on elastic1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:50:17] (03PS1) 10Elukey: Add aqs1008-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) [11:50:54] (03PS1) 10ArielGlenn: sample uwsgi app that would produce json status output for dumps [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/335007 (https://phabricator.wikimedia.org/T147177) [11:51:10] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2981625 (10akosiaris) Turns out the CPU increase mentioned above is not the result of some bug or otherwise malfunction/chan... [11:51:14] (03PS2) 10Elukey: Add aqs1008-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) [11:51:33] PROBLEM - Elasticsearch HTTPS on elastic1029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:51:50] <_joe_> uh? [11:52:01] <_joe_> oh ok [11:52:03] RECOVERY - Elasticsearch HTTPS on elastic1023 is OK: SSL OK - Certificate elastic1023.eqiad.wmnet valid until 2021-03-15 20:24:40 +0000 (expires in 1505 days) [11:52:43] PROBLEM - Elasticsearch HTTPS on elastic1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:53:43] RECOVERY - Elasticsearch HTTPS on elastic1020 is OK: SSL OK - Certificate elastic1020.eqiad.wmnet valid until 2021-03-15 20:21:21 +0000 (expires in 1505 days) [11:53:53] PROBLEM - Elasticsearch HTTPS on elastic1031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:54:43] PROBLEM - Elasticsearch HTTPS on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:54:43] RECOVERY - Elasticsearch HTTPS on elastic1021 is OK: SSL OK - Certificate elastic1021.eqiad.wmnet valid until 2021-08-31 15:30:11 +0000 (expires in 1674 days) [11:54:44] one user. meh [11:56:43] PROBLEM - Elasticsearch HTTPS on elastic1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:57:43] PROBLEM - Elasticsearch HTTPS on elastic1034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:59:03] PROBLEM - Elasticsearch HTTPS on elastic1035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:59:43] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:00:33] PROBLEM - Elasticsearch HTTPS on elastic1036 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:00:43] RECOVERY - Elasticsearch HTTPS on elastic1028 is OK: SSL OK - Certificate elastic1028.eqiad.wmnet valid until 2021-08-31 16:12:42 +0000 (expires in 1674 days) [12:01:53] PROBLEM - Elasticsearch HTTPS on elastic1037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:01:53] RECOVERY - Elasticsearch HTTPS on elastic1031 is OK: SSL OK - Certificate elastic1031.eqiad.wmnet valid until 2021-03-15 20:33:49 +0000 (expires in 1505 days) [12:02:42] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5276/aqs1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [12:05:43] RECOVERY - Elasticsearch HTTPS on elastic1033 is OK: SSL OK - Certificate elastic1033.eqiad.wmnet valid until 2021-06-21 12:10:36 +0000 (expires in 1603 days) [12:05:52] 06Operations: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804#2981638 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:05:53] PROBLEM - Elasticsearch HTTPS on elastic1040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:05:53] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1046.eqiad.wmnet because of too many down! [12:06:33] RECOVERY - Elasticsearch HTTPS on elastic1029 is OK: SSL OK - Certificate elastic1029.eqiad.wmnet valid until 2021-08-31 18:02:18 +0000 (expires in 1674 days) [12:06:43] RECOVERY - Elasticsearch HTTPS on elastic1025 is OK: SSL OK - Certificate elastic1025.eqiad.wmnet valid until 2021-03-15 20:26:54 +0000 (expires in 1505 days) [12:06:43] RECOVERY - Elasticsearch HTTPS on elastic1030 is OK: SSL OK - Certificate elastic1030.eqiad.wmnet valid until 2021-03-15 20:32:44 +0000 (expires in 1505 days) [12:07:43] PROBLEM - Elasticsearch HTTPS on elastic1041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:07:51] (03PS1) 10Muehlenhoff: Remove otto from piwik-roots [puppet] - 10https://gerrit.wikimedia.org/r/335010 (https://phabricator.wikimedia.org/T142836) [12:07:53] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [12:08:12] * gehel is looking at elasticsearch ... [12:08:33] RECOVERY - Elasticsearch HTTPS on elastic1036 is OK: SSL OK - Certificate elastic1036.eqiad.wmnet valid until 2021-06-21 12:10:51 +0000 (expires in 1603 days) [12:08:43] RECOVERY - Elasticsearch HTTPS on elastic1034 is OK: SSL OK - Certificate elastic1034.eqiad.wmnet valid until 2021-06-21 12:10:41 +0000 (expires in 1603 days) [12:08:43] RECOVERY - Elasticsearch HTTPS on elastic1032 is OK: SSL OK - Certificate elastic1032.eqiad.wmnet valid until 2021-06-21 08:40:25 +0000 (expires in 1602 days) [12:08:43] RECOVERY - Elasticsearch HTTPS on elastic1041 is OK: SSL OK - Certificate elastic1041.eqiad.wmnet valid until 2021-06-21 13:36:01 +0000 (expires in 1603 days) [12:08:53] RECOVERY - Elasticsearch HTTPS on elastic1037 is OK: SSL OK - Certificate elastic1037.eqiad.wmnet valid until 2021-06-21 12:10:56 +0000 (expires in 1603 days) [12:08:53] RECOVERY - Elasticsearch HTTPS on elastic1040 is OK: SSL OK - Certificate elastic1040.eqiad.wmnet valid until 2021-06-21 13:35:56 +0000 (expires in 1603 days) [12:09:03] RECOVERY - Elasticsearch HTTPS on elastic1035 is OK: SSL OK - Certificate elastic1035.eqiad.wmnet valid until 2021-06-21 12:10:46 +0000 (expires in 1603 days) [12:09:15] (03PS2) 10Alexandros Kosiaris: ores::redis: Enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/334663 [12:11:03] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:11:04] (03PS1) 10Muehlenhoff: Remove aaron from logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/335011 (https://phabricator.wikimedia.org/T142836) [12:13:22] (03PS1) 10Muehlenhoff: Remove otto from aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/335012 (https://phabricator.wikimedia.org/T142836) [12:18:11] (03PS1) 10Muehlenhoff: Remove elukey from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/335013 (https://phabricator.wikimedia.org/T142836) [12:20:27] (03PS1) 10Muehlenhoff: Remove aaron from contint-users [puppet] - 10https://gerrit.wikimedia.org/r/335014 (https://phabricator.wikimedia.org/T142836) [12:21:16] 06Operations, 10hardware-requests: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2981648 (10akosiaris) [12:21:30] (03CR) 10Elukey: [C: 032] Remove elukey from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/335013 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [12:21:35] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2972283 (10akosiaris) [12:21:37] 06Operations, 10hardware-requests: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2981648 (10akosiaris) [12:22:00] 06Operations, 10hardware-requests: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2981648 (10akosiaris) p:05Triage>03High [12:26:02] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5277/ NOOP, merging" [puppet] - 10https://gerrit.wikimedia.org/r/334663 (owner: 10Alexandros Kosiaris) [12:26:09] (03PS3) 10Alexandros Kosiaris: ores::redis: Enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/334663 [12:26:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores::redis: Enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/334663 (owner: 10Alexandros Kosiaris) [12:27:43] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:28:02] <_joe_> thanks akosiaris :) [12:28:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Er, wrong paste. https://puppet-compiler.wmflabs.org/5278/oresrdb1001.eqiad.wmnet/ say's it's fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/334663 (owner: 10Alexandros Kosiaris) [12:37:32] 06Operations, 10ops-codfw, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2981676 (10elukey) I quickly tested salt/puppet to help out and and everything seems working as expected except mc2033 and mc2034 (powered down?). [12:38:30] hashar: as usual on Mondays, today's eu swat is full https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0January.C2.A030 [12:43:13] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:44:47] 06Operations, 10hardware-requests: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2981696 (10mark) @RobH: let's get this out ASAP, we need to have this in place before the switchover, start of April. [12:54:30] zeljkof: yeah they look quite straightforward [12:54:32] 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2981702 (10MoritzMuehlenhoff) "volans" and "elukey" have been removed from the nda group since they're WMF staff and present in the "wmf" group already. [12:55:13] hashar: except the last one [12:55:30] it links to commit in core in master, already merged o.O [12:58:03] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:20] (03CR) 10Hashar: [C: 04-1] "There are a few similar pending requests all tracked from T51357. I would rather NOT land this until it is reviewed/confirmed by someone " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334787 (https://phabricator.wikimedia.org/T155892) (owner: 10Urbanecm) [13:10:45] (03CR) 10Hashar: [C: 031] Remove flaggedrevs-protect-review page protection from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) (owner: 10Urbanecm) [13:11:34] (03CR) 10Hashar: [C: 031] Enable SandboxLink on gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [13:11:38] (03CR) 10Hashar: [C: 031] Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [13:12:13] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:14:22] (03PS4) 10Hashar: Enable expiring user groups on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 (owner: 10TTO) [13:14:51] (03CR) 10Hashar: [C: 032] "You are more than welcome to add beta cluster only changes to the SWAT, though most of the time we just merge them on request :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 (owner: 10TTO) [13:15:15] hashar: Thanks, I was never quite sure what to do with beta cluster only ones :) [13:16:02] There's a list of inclusion criteria but no list of "what you don't need to use SWAT for" [13:16:19] (03Merged) 10jenkins-bot: Enable expiring user groups on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 (owner: 10TTO) [13:16:22] and "who to ping about things you don't need to use SWAT for" [13:17:52] (03CR) 10jenkins-bot: Enable expiring user groups on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333652 (owner: 10TTO) [13:18:51] tto: well you can add beta cluster config changes to SWAT for sure :D [13:19:03] tto: or just poke deployers to land them at anytime :] [13:19:27] if you add it to swat, you have a guarantee it will land without further action [13:19:33] so it is all fine :] [13:19:46] I'll remember that for future. Especially as first SWAT of the week is always close to full; others will be happier if unnecessary changes are done outside SWAT :) [13:26:13] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:33:08] (03PS4) 10ArielGlenn: dumps: Modernize design of the index page [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [13:35:45] (03CR) 10ArielGlenn: "Because this file is processed as a template with % substitutions, lone % in css properties must be replaced with %%; I've done so in this" [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [13:40:32] (03PS2) 10Muehlenhoff: Remove elukey from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/335013 (https://phabricator.wikimedia.org/T142836) [13:47:43] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:33] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [13:53:33] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [13:54:31] (03PS1) 10Gehel: maps - increase replication frequency to each minute on maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/335022 (https://phabricator.wikimedia.org/T145534) [13:56:04] (03PS2) 10Gehel: maps - increase replication frequency to each minute on maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/335022 (https://phabricator.wikimedia.org/T145534) [13:57:40] (03CR) 10Gehel: [C: 032] maps - increase replication frequency to each minute on maps-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/335022 (https://phabricator.wikimedia.org/T145534) (owner: 10Gehel) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1400). Please do the needful. [14:00:04] Urbanecm, tto, and MatmaRex: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:06] (03CR) 10Filippo Giunchedi: [C: 031] Add aqs1008-a to the AQS Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [14:00:39] hashar: want to do swat today? I am in the middle of webdriverio patch... [14:01:04] zeljkof: will do [14:01:11] hashar: thanks! [14:01:26] I'm around if you need help ;) [14:01:33] me too ;) [14:01:41] (03PS2) 10Hashar: Enable SandboxLink on gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [14:01:43] (03PS2) 10Hashar: Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [14:01:46] doing the sandboxlinks for Urbanecm [14:02:01] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [14:02:03] (03CR) 10Hashar: [C: 031] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [14:02:54] zeljkof, late but around today :) [14:03:37] (03Merged) 10jenkins-bot: Enable SandboxLink on gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [14:03:40] zeljkof, late but around today :) [14:03:48] (03CR) 10jenkins-bot: Enable SandboxLink on gdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334157 (https://phabricator.wikimedia.org/T156281) (owner: 10Urbanecm) [14:03:53] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [14:04:29] (03CR) 10Hashar: [C: 032] Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [14:04:47] I'm a bit confused. Who is SWAtter for today? zeljkof? Or hashar? Or somebody else? [14:04:50] Urbanecm: Enable SandboxLink on gdwiki is enabled [14:04:57] Urbanecm: I will do it [14:05:03] At prod? [14:05:11] hashar, ok [14:05:27] Urbanecm: it's hashar [14:05:45] zeljkof, thank you too :) [14:05:51] (03PS1) 10Marostegui: db-eqiad.php: Restore db1073 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335023 (https://phabricator.wikimedia.org/T156226) [14:06:09] (03Merged) 10jenkins-bot: Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [14:06:17] (03CR) 10jenkins-bot: Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) (owner: 10Urbanecm) [14:06:29] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on gdwiki - T156281 (duration: 00m 48s) [14:06:32] (03PS2) 10Hashar: Remove flaggedrevs-protect-review page protection from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) (owner: 10Urbanecm) [14:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:34] T156281: Enable SandboxLink on gdwiki - https://phabricator.wikimedia.org/T156281 [14:06:44] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) (owner: 10Urbanecm) [14:07:05] hashar, gdwiki working now (I thought it was already enabled not you're going to enable it) [14:07:21] tgwiki as well [14:07:28] thx [14:08:10] (03Merged) 10jenkins-bot: Remove flaggedrevs-protect-review page protection from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) (owner: 10Urbanecm) [14:08:13] the enwiki flaggedrevs / PC2 change I have no idea how to test it :( [14:08:25] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on tgwiki - T156473 (duration: 00m 40s) [14:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:29] T156473: Enable SandboxLink on tg.wikipedia - https://phabricator.wikimedia.org/T156473 [14:08:40] (03CR) 10Elukey: "Thanks! Will wait for aqs1007-b to be fully boostrapped before proceeding :)" [puppet] - 10https://gerrit.wikimedia.org/r/335006 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [14:08:41] hashar, ask local sysops? [14:09:02] I am going to just push it [14:09:21] Okay, let's push it and I'll ask for confirmation in the task. Do you agree? [14:09:29] yup [14:09:35] Ok [14:09:44] the Sandbox link, I believe we should just enable it on all wiki [14:09:45] s [14:09:51] (03CR) 10jenkins-bot: Remove flaggedrevs-protect-review page protection from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) (owner: 10Urbanecm) [14:09:53] hashar: I was about to ask some info for the new rake stuff, I promise that I'll try to review the patches today/tomorrow [14:10:15] elukey: sure thing! Poke me any time for more details / if you want a demo / crash course or whatever :] [14:10:30] !log hashar@tin Synchronized wmf-config/flaggedrevs.php: Remove flaggedrevs-protect-review page protection from enwiki - T156448 (duration: 00m 41s) [14:10:30] elukey: I will be more than happy to demo it over a hangout session with screen sharing :] [14:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] T156448: Remove flaggedrevs-protect-review (PC2) page protection option from the English Wikipedia - https://phabricator.wikimedia.org/T156448 [14:10:36] Urbanecm: [config] 334787 Increase default thumb size to 250px at nowiki [14:10:37] (03PS1) 10Elukey: Add aqs1007 to AQS's conftool data [puppet] - 10https://gerrit.wikimedia.org/r/335024 (https://phabricator.wikimedia.org/T155654) [14:10:43] hashar, I agree. But shouldn't we notify the communities at least? [14:10:57] Urbanecm: I have skipped that one. Not quite sure what are the technical impacts of changing the thumbsize. That is really all a technical debt [14:11:13] Urbanecm: we would probably want to change the thumbsize for all wikis [14:11:29] (03PS2) 10Hashar: Create namespace alias وگ for NS_PROJECT in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334510 (https://phabricator.wikimedia.org/T156451) (owner: 10Urbanecm) [14:11:35] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334510 (https://phabricator.wikimedia.org/T156451) (owner: 10Urbanecm) [14:12:39] hashar, okay, you don't 100% it, you skipped it but the task exist since 2013... [14:12:59] Urbanecm: yeah that is an old topic :/ [14:13:13] (03Merged) 10jenkins-bot: Create namespace alias وگ for NS_PROJECT in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334510 (https://phabricator.wikimedia.org/T156451) (owner: 10Urbanecm) [14:13:28] I missed the "trust" word... [14:13:28] I blocked such updates a few years ago because I could not assert the impact on the Wikimedia thumbnailing infrastructure [14:13:31] (03CR) 10jenkins-bot: Create namespace alias وگ for NS_PROJECT in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334510 (https://phabricator.wikimedia.org/T156451) (owner: 10Urbanecm) [14:13:55] but thumbnailing has changed a lot, so maybe we can just change the default size for everyone. I am not sure really [14:14:21] hashar, okay. Should I mark the task as declined? And/or abandon the change? [14:14:27] keep it open [14:14:31] until the parent is figured out [14:14:35] Ok [14:14:38] potentially you can raise it on wikitech-l [14:14:45] so appropriate people look at the impact [14:14:54] and take a decision as to what should be the default thumbsize [14:15:02] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Create namespace alias وگ for NS_PROJECT in fawikiquote - T156451 (duration: 00m 40s) [14:15:03] Ok, I'll write there. [14:15:05] nowayda,s maybe it is not a big deal for the cache infra [14:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:06] T156451: Create namespace alias وگ for NS_PROJECT in fawikiquote - https://phabricator.wikimedia.org/T156451 [14:15:43] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:16:30] um I don't think MatmaRex is here right now [14:22:28] Urbanecm: I quite messed up some pages on fawikiquote :( [14:22:58] I really need to learn farsi [14:23:00] hashar, will it be easy to demess them? [14:23:04] What is farsi? [14:23:11] persian [14:23:14] the language from Iran [14:23:24] Yeah, thanks [14:23:40] Iran is quite a fascinating country and more or less a unique culture in the middle east [14:24:16] well I guess the bulk of the work is done for fawikiquote, we can keep it open [14:24:27] I am subscribed to the task so I can follow up / help if people ask questions [14:24:27] I am quite familiar with geography of our planet. I just thought it is something technical, some way how to fix it :D [14:24:58] (03PS2) 10Hashar: Enable RSS extension at metawiki, enable one feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334864 (https://phabricator.wikimedia.org/T155830) (owner: 10Urbanecm) [14:25:03] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334864 (https://phabricator.wikimedia.org/T155830) (owner: 10Urbanecm) [14:25:05] Hm, I don't understand you. If bulk of work is done why it should be open? Should I do anything? [14:25:22] Urbanecm: there are seven pages that conflicted [14:25:49] having the same name in the NS_MAIN and the new NS_PROJECT [14:26:22] (03Merged) 10jenkins-bot: Enable RSS extension at metawiki, enable one feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334864 (https://phabricator.wikimedia.org/T155830) (owner: 10Urbanecm) [14:26:27] Maybe we should just list them in task/paste and ask them for fixing. [14:26:36] I did :] [14:26:39] Good :) [14:26:49] BTW are they accessible? [14:27:48] (03CR) 10jenkins-bot: Enable RSS extension at metawiki, enable one feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334864 (https://phabricator.wikimedia.org/T155830) (owner: 10Urbanecm) [14:28:27] Ignore my question, I read the task :) [14:29:39] [config] 334864 Enable RSS extension at metawiki, enable one feed [14:29:39] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable RSS extension at metawiki, enable one feed - T155830 (duration: 00m 42s) [14:29:41] that one works [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:43] T155830: Enable "Wikimedia DE Policy News Update" RSS feed on meta.wikimedia.org - https://phabricator.wikimedia.org/T155830 [14:29:44] validated it myself [14:29:51] thx [14:30:00] tto: ] 333652 Enable expiring user groups on beta [14:30:10] tto: I have deployed that one an hour or so ago so it is all set :D [14:30:14] Urbanecm: thank you for all those changes! [14:30:21] It is working indeed :) [14:30:29] hashar, thanks for deploying all those changes! [14:31:53] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:33:24] 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2981868 (10MoritzMuehlenhoff) "dcausses" and "matmarex" have been removed from the nda group since they're WMF staff and present in the "wmf" group already. [14:45:53] !log hashar@tin Synchronized php-1.29.0-wmf.9/languages/Language.php: translateBlockExpiry: Duration is block expiry minus current time - T156453 (duration: 00m 42s) [14:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] T156453: BlockLogFormatter formats relative timestamps with duration since Unix epoch - https://phabricator.wikimedia.org/T156453 [14:51:45] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#2981917 (10Gehel) This is worth discussing with our #traffic team. @BBlack, @ema: what is your point of view on rate-limiting browser without referer? Varn... [14:56:49] (03PS2) 10DCausse: [WIP] Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) [15:01:13] 06Operations, 10ops-codfw, 10DBA: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2981946 (10Marostegui) [15:02:18] (03PS2) 10Elukey: Add aqs1007 to AQS's conftool data [puppet] - 10https://gerrit.wikimedia.org/r/335024 (https://phabricator.wikimedia.org/T155654) [15:21:46] (03CR) 10Elukey: [C: 032] Add aqs1007 to AQS's conftool data [puppet] - 10https://gerrit.wikimedia.org/r/335024 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [15:23:41] (03PS6) 10Rush: Tools: Disable automatic backups of aptly repositories [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) (owner: 10Tim Landscheidt) [15:28:03] PROBLEM - DPKG on cp3020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:31:03] RECOVERY - DPKG on cp3020 is OK: All packages OK [15:33:53] (03CR) 10Rush: [C: 032] Tools: Disable automatic backups of aptly repositories [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) (owner: 10Tim Landscheidt) [15:35:23] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2982093 (10Fjalapeno) Thank you all for pushing this forward! This is really great for the reading platforms where we have had lots of ambiguity with t... [15:37:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1073 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335023 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [15:37:33] (03PS12) 10Paladox: Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 [15:37:39] (03CR) 10Paladox: [C: 031] Gerrit: Add a systemd init script fro gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [15:38:03] (03CR) 10Paladox: [C: 031] Gerrit: Add a systemd init script fro gerrit (032 comments) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [15:38:25] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1073 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335023 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [15:38:33] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1073 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335023 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [15:39:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 with its original weight - T156226 (duration: 00m 52s) [15:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:59] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [15:42:08] !log hashar@tin Synchronized php-1.29.0-wmf.9/extensions/timeline/Timeline.body.php: debug log EasyTimeline error - T138036 (duration: 00m 46s) [15:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:12] T138036: proc line: 2959: warning: points must have either 4 or 2 values per line - https://phabricator.wikimedia.org/T138036 [15:57:15] apergos: hey, should we wait for more feedback or wait some more? tell me when it's okay to merge it [16:00:04] legoktm and arlolra: Respected human, time to deploy Linter deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1600). Please do the needful. [16:01:16] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982158 (10Papaul) Disk replacement complete on slot 11 [16:01:34] Amir1: which? [16:01:55] apergos: the dumps UI patch [16:02:06] thank you, pap*ul [16:05:04] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982163 (10Marostegui) Thanks - It is getting rebuilt ``` root@db2011:/usr/local/bin# megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 44% in 1... [16:06:29] Amir1: did you see Nemo_bis' comment about the font? Can we use open fonts? [16:06:33] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:37] and I would leave it a couple days [16:07:05] but I'll be merging it before the end of the week unless we get a whole bunch of people adding changest to it (pretty unlikely) [16:09:48] 06Operations, 10hardware-requests: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2982181 (10RobH) a:03RobH I'll create the required sub-tasks today. [16:10:47] apergos: Yup, I'll fix it. Thanks! [16:11:24] yw, thanks for the fix! [16:14:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335048 (https://phabricator.wikimedia.org/T156478) [16:16:34] legoktm can I deploy mediawiki config? I saw you have a deployment window now, so I will wait for you guys :) [16:20:17] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2982238 (10Papaul) a:05Marostegui>03Papaul [16:23:13] RECOVERY - MegaRAID on db2011 is OK: OK: optimal, 1 logical, 2 physical [16:23:25] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982242 (10Marostegui) Rebuilt finished successfully ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices... [16:23:41] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982243 (10Marostegui) 05Open>03Resolved a:03Papaul [16:26:43] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:55] marostegui: yes go for it [16:28:02] thanks! [16:28:09] I'll start my scap once you're done [16:28:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335048 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:28:29] thanks, it should take no time :) [16:29:10] (03PS1) 10Legoktm: Add Linter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335049 [16:34:26] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2003.codfw.wmnet [16:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:33] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:35:15] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2982316 (10Gehel) [16:35:18] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs2003 - https://phabricator.wikimedia.org/T152644#2982313 (10Gehel) 05Open>03Resolved wdqs2003 has completed data import, it is now pooled [16:36:13] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335048 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:38:12] marostegui: I think jenkins is just behind... [16:39:24] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335048 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:39:32] (03CR) 10jenkins-bot: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335048 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:40:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2034 for maintenance - T156478 (duration: 00m 40s) [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:03] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [16:41:03] legoktm: you are good to go :-) [16:41:15] thanks! [16:41:18] (03CR) 10Legoktm: [C: 032] Add Linter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335049 (owner: 10Legoktm) [16:43:49] (03Merged) 10jenkins-bot: Add Linter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335049 (owner: 10Legoktm) [16:44:04] (03CR) 10jenkins-bot: Add Linter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335049 (owner: 10Legoktm) [16:46:27] !log legoktm@tin Started scap: Build l10n cache for linter [16:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:03] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:16] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2982398 (10faidon) p:05Unbreak!>03High Looks stable for now, lowering priority. [16:54:43] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:55:31] ostriches: thcipriani: uh, scap is complaning about stuff [16:55:32] 16:55:03 Started cache_git_info [16:55:32] 16:55:03 Unable to find remote tracking branch/tag for /srv/mediawiki-staging/php-1.29.0-wmf.9/extensions/Popups [16:55:39] and then every extension [16:56:05] I assume cache_git_info breaking isn't the end of the world so I'm letting it continue for now [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1700). Please do the needful. [17:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:04:13] legoktm: yup. That's fixed in master/release/scap release that is ready to go, probably get it out later today. [17:04:32] thcipriani: ok, so it's fine to ignore? [17:04:58] legoktm: yes, please do ignore for now. Thanks for the ping about it though. [17:05:19] !log Shutdown mysql and poweroff db2034 for maintenance - T156478 [17:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:25] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [17:09:10] !log legoktm@tin Finished scap: Build l10n cache for linter (duration: 22m 43s) [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:48] (I'm done) [17:12:42] (03PS1) 10Legoktm: [WIP] Enable Linter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335052 [17:13:47] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests: Decommission neptunium - https://phabricator.wikimedia.org/T122101#2982443 (10RobH) [17:13:49] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#2982444 (10RobH) [17:13:51] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#2982446 (10RobH) [17:13:54] 06Operations, 10ops-eqiad, 06DC-Ops, 10hardware-requests, 13Patch-For-Review: Decommission rubidium - https://phabricator.wikimedia.org/T118213#2982445 (10RobH) [17:13:56] 06Operations, 10hardware-requests: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679#2982442 (10RobH) 05Open>03Resolved [17:14:52] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Update db2034 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335053 (https://phabricator.wikimedia.org/T156478) [17:19:41] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#2982461 (10mobrovac) [17:21:05] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:22:05] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2982466 (10mobrovac) [17:23:47] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2982469 (10BBlack) We should probably divorce the RO/RW distinction from the core design here. Not all services... [17:25:48] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2982472 (10fgiunchedi) [17:26:58] (03PS1) 10Papaul: DNS: Change db2034 production dns.Server has been moved from row C to Row A Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/335054 [17:29:51] (03CR) 10Marostegui: [C: 031] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/335054 (owner: 10Papaul) [17:41:39] !log update RESTBase to cd2b5e019: staging [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:11] PROBLEM - Restbase root url on xenon is CRITICAL: connect to address 10.64.0.200 and port 7231: Connection refused [17:45:11] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.200, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd88a82b990: Failed to establish a new connection: [Errno 111] Connection refused,)) [17:45:48] ^^ this is OK, it's expected - new keyspaces are being created [17:46:11] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.028 second response time [17:46:11] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [17:49:01] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2982595 (10GWicke) > if specific services needs a split into "active/passive RW + active/active RO", we can solv... [17:51:01] PROBLEM - Restbase root url on restbase-dev1001 is CRITICAL: connect to address 10.64.0.35 and port 7231: Connection refused [17:51:12] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.35, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f54b7b5c950: Failed to establish a new connection: [Errno 111] Connection refused,)) [17:52:01] RECOVERY - Restbase root url on restbase-dev1001 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.010 second response time [17:52:11] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [17:52:41] (03CR) 10Tim Landscheidt: [V: 032 C: 031] "I had the same idea while AFK, and my patience saved me work :-); thanks. I tested this successfully by patching /usr/lib/python2.7/dist-" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) (owner: 10Zhuyifei1999) [17:53:50] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2982604 (10Milimetric) Thanks, @jcrespo, I didn't see the ping, it looks like Phabricator had some notification issues. The idea with this service is that it wouldn't take time away from ops, s... [17:56:27] Something caused MW error rate to skyrocket ~3 hours ago: https://grafana.wikimedia.org/dashboard/db/production-logging?from=now-4h&to=now [17:57:16] er ~1 hour ago [17:58:30] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982608 (10Papaul) @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to make some changes on the both switches .... [17:59:32] twentyafterfour: DBReplication channel looks to be the culprit -- https://logstash.wikimedia.org/goto/1da8da5ef2a75d1505730e15bec9e036 [17:59:57] something is lagging I'd guess [18:00:04] Niharika and bd808: Dear anthropoid, the time has come. Please deploy Wikimania scholarships deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1800). [18:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1800). Please do the needful. [18:00:04] gehel: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [18:01:31] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Update db2034 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335053 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [18:02:54] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Update db2034 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335053 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [18:03:04] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Update db2034 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335053 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [18:04:09] !log rolling restart of nginx and wdqs for updates [18:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:14] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2034 IP - T156478 (duration: 00m 40s) [18:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:20] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [18:05:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2034 IP - T156478 (duration: 00m 40s) [18:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:31] PROBLEM - Check systemd state on wdqs1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:06:35] PROBLEM - DPKG on wdqs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:06:35] PROBLEM - WDQS HTTP Port on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [18:06:35] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:06:41] PROBLEM - WDQS HTTP Port on wdqs1002 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [18:06:42] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: connect to address 10.192.32.148 and port 80: Connection refused [18:06:51] PROBLEM - WDQS HTTP Port on wdqs2002 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [18:06:51] PROBLEM - DPKG on wdqs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:06:52] PROBLEM - DPKG on wdqs2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:06:53] PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: connect to address 10.192.48.65 and port 80: Connection refused [18:06:53] PROBLEM - WDQS HTTP on wdqs2002 is CRITICAL: connect to address 10.192.48.65 and port 80: Connection refused [18:07:01] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - wdqs_80 - Could not depool server wdqs2002.codfw.wmnet because of too many down! [18:07:11] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - wdqs_80 - Could not depool server wdqs2002.codfw.wmnet because of too many down! [18:07:18] Oops, that's me failing my nginx upgrade on wdqs, I'0m on it [18:07:48] !log update RESTBase to cd2b5e019: canary on restbase1007 [18:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:31] RECOVERY - Check systemd state on wdqs1002 is OK: OK - running: The system is fully operational [18:08:32] RECOVERY - DPKG on wdqs1001 is OK: All packages OK [18:08:38] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982644 (10RobH) >>! In T156478#2982608, @Papaul wrote: > @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to m... [18:08:41] RECOVERY - WDQS HTTP Port on wdqs1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [18:08:51] RECOVERY - DPKG on wdqs1002 is OK: All packages OK [18:09:31] RECOVERY - WDQS HTTP Port on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [18:09:32] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [18:09:41] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 10479 bytes in 0.073 second response time [18:09:51] RECOVERY - WDQS HTTP Port on wdqs2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [18:09:51] RECOVERY - DPKG on wdqs2001 is OK: All packages OK [18:09:52] RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 10479 bytes in 0.073 second response time [18:09:53] RECOVERY - WDQS HTTP on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 10479 bytes in 0.073 second response time [18:10:01] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [18:10:13] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:10:41] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [18:11:22] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1001.codfw.wmnet [18:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:51] !log upgrading firejail on scb cluster [18:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:27] !log updated scholarships Fixed some bugs with the login form [18:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:59] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1002.codfw.wmnet [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:11] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [18:14:48] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#2982674 (10jcrespo) I recently packaged and puppetized [[ http://proxysql.com/ | ProxySQL ]]: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/proxysql/manifests/init.p... [18:15:42] (03PS2) 10Zhuyifei1999: kubernetesbackend: change absolute kubectl path to '/usr/bin/kubectl' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) [18:18:56] !log nginx upgrade and wdqs restart complete - sorry for the noise [18:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:13] (03CR) 10Mobrovac: [C: 031] graphoid/gridengine/grub/haproxy/hhvm lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334319 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:28:46] PROBLEM - MariaDB Slave SQL: s1 on db1072 is CRITICAL: CRITICAL slave_sql_state could not connect [18:28:53] I thought I had downtimed that :( [18:29:09] I think that is depooled [18:29:12] it is [18:29:18] i see I only downtimed the lag [18:29:19] ok, so no worries :D [18:29:28] (03PS1) 10Andrew Bogott: novaproxy: Specify ssl_settings of [] if not using ssl. [puppet] - 10https://gerrit.wikimedia.org/r/335064 [18:29:36] PROBLEM - MariaDB Slave IO: s1 on db1072 is CRITICAL: CRITICAL slave_io_state could not connect [18:29:40] alters based on etcd [18:29:46] will check if a server is pooled [18:29:53] and avoid criticals in that case [18:30:05] amirite? [18:30:09] marostegui: at least you are less verbose than I am :) [18:30:13] XDD [18:30:44] gehel, wait until a master goes down, and 20 servers page at the same time [18:31:22] jynus: yeah, but that never happens... [18:31:26] ha [18:31:29] (03PS1) 10Yuvipanda: labs: Fix novaproxy not working when use_ssl is false [puppet] - 10https://gerrit.wikimedia.org/r/335065 [18:31:29] nooooo [18:31:31] so young [18:31:32] don't say that :( [18:31:36] so inocent [18:32:16] (03CR) 10Andrew Bogott: [C: 031] labs: Fix novaproxy not working when use_ssl is false [puppet] - 10https://gerrit.wikimedia.org/r/335065 (owner: 10Yuvipanda) [18:32:40] (03PS3) 10ArielGlenn: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [18:33:06] (03Abandoned) 10Andrew Bogott: novaproxy: Specify ssl_settings of [] if not using ssl. [puppet] - 10https://gerrit.wikimedia.org/r/335064 (owner: 10Andrew Bogott) [18:33:13] (03CR) 10Andrew Bogott: [C: 032] labs: Fix novaproxy not working when use_ssl is false [puppet] - 10https://gerrit.wikimedia.org/r/335065 (owner: 10Yuvipanda) [18:35:41] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.007 second response time [18:36:11] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [18:36:30] !log upload scap 3.5.0-1 - T127762 [18:36:32] thcipriani: ^ [18:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:35] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [18:36:44] godog: awesome :) [18:37:17] godog: scap version/config update: https://gerrit.wikimedia.org/r/#/c/334677/ [18:37:53] (03CR) 10Awight: "Hi! I just saw your comment from November... I agree, how about a xmldumps-backup/conf directory which would hold default config?" [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [18:38:13] !log update RESTBase to cd2b5e019: canary on restbase2001 [18:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:57] thcipriani: ok! going with that [18:39:47] godog: cool. Possible to force a puppet run on tin with that? I'd like to test it there as soon as it's available. [18:39:56] (03PS2) 10Filippo Giunchedi: Scap: Bump version to 3.5.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/334677 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [18:40:38] thcipriani: yep, I've upgraded scap manually on tin in the meantime [18:40:55] ah, cool. Lemme give that a shot real quick. [18:41:22] (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.5.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/334677 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [18:42:25] ok, running a test sync on tin of README [18:42:26] (03CR) 10ArielGlenn: "That's going to be the only file in there. Let's see what else is in here: doc, samples. Not loving it." [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [18:42:35] !log update RESTBase to cd2b5e019 [18:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:36] RECOVERY - MariaDB Slave IO: s1 on db1072 is OK: OK slave_io_state Slave_IO_Running: Yes [18:43:42] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982801 (10Papaul) @Marostegui server is now in A5. Just waiting for https://gerrit.wikimedia.org/r/#/c/335054/ to be merge. [18:43:56] RECOVERY - MariaDB Slave SQL: s1 on db1072 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:46:20] !log nuria@tin Started deploy [eventlogging/analytics@4b28b14]: (no justification provided) [18:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] !log nuria@tin Finished deploy [eventlogging/analytics@4b28b14]: (no justification provided) (duration: 00m 04s) [18:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:50] (03CR) 10Jcrespo: [C: 032] DNS: Change db2034 production dns.Server has been moved from row C to Row A Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/335054 (owner: 10Papaul) [18:47:54] (03PS2) 10Jcrespo: DNS: Change db2034 production dns.Server has been moved from row C to Row A Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/335054 (owner: 10Papaul) [18:48:24] !log mediawiki deployments momentarily [18:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:00] ^I assume you mean "blocking?" [18:50:15] !log rollback deployment to eventlogging [18:50:15] jynus: yes, sorry [18:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:28] you've accidentally mediawiki deployments [18:50:41] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-01 [18:53:10] !log nuria@tin Started deploy [eventlogging/analytics@4b28b14]: (no justification provided) [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:22] !log nuria@tin Finished deploy [eventlogging/analytics@4b28b14]: (no justification provided) (duration: 00m 11s) [18:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:38] !log unlocking mediawiki deployments for test [18:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:48] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T1900). Please do the needful. [19:00:29] SWAT will have to be on hold if there are patches, still testing new scap version [19:02:57] (03PS3) 10Dzahn: admin: fix log file perms for dc-ops on jessie [puppet] - 10https://gerrit.wikimedia.org/r/334719 (https://phabricator.wikimedia.org/T156529) [19:05:12] (03CR) 10Dzahn: [C: 032] admin: fix log file perms for dc-ops on jessie [puppet] - 10https://gerrit.wikimedia.org/r/334719 (https://phabricator.wikimedia.org/T156529) (owner: 10Dzahn) [19:07:30] thcipriani: No patches, you're clear [19:08:48] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [19:10:09] 06Operations, 10Graphite, 13Patch-For-Review: provide aggregated cluster data with graphite, similar to ganglia - https://phabricator.wikimedia.org/T119520#2982910 (10fgiunchedi) 05Open>03declined Declining, this functionality is now provided by Prometheus [19:10:13] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2982912 (10DarTar) a:05DarTar>03ellery [19:12:40] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10greg) UBN! for over a week? [19:15:54] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982951 (10jcrespo) Merged. Virtual console is busy (I assume by yourself), so I do not have visibility of the state of the server right now. [19:17:48] PROBLEM - MD RAID on relforge1001 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 2, Spare: 0 [19:17:48] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2982953 (10Andrew) p:05Unbreak!>03Normal [19:17:49] ACKNOWLEDGEMENT - MD RAID on relforge1001 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T156663 [19:17:53] 06Operations, 10ops-eqiad: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#2982954 (10ops-monitoring-bot) [19:18:41] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#2982960 (10RobH) a:05RobH>03mark I currently have a total of 5 spare pool systems in codfw. Of these 5, 3 of them may meet the specificat... [19:23:13] 06Operations, 10Monitoring: ganglia graphs should not have "N" as units - https://phabricator.wikimedia.org/T81659#2982979 (10fgiunchedi) 05Open>03declined We're replacing Ganglia with Prometheus [19:31:37] (03PS4) 10ArielGlenn: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [19:41:19] Urbanecm: did you get the block fix swatted ? [19:42:08] matanya: no swat at the moment [19:42:11] (03CR) 10ArielGlenn: "After an irc chat with awight, here's the compromise no one likes :-D" [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [19:42:16] fixing something with scap, sorry :( [19:44:33] thanks thcipriani [19:44:47] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Monitor Certificate Transparency (CT) logs - https://phabricator.wikimedia.org/T155807#2983086 (10faidon) This is the example output from just a few moments ago: ``` faidon@einsteinium:~$ certspotter 9a5646f8202e95c8df870ee1d36267fddd70ab5471060509b58f... [19:44:50] !log deploying latest wdqs gui [19:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:33] !log gehel@tin Started deploy [wdqs/wdqs@81442a0]: (no justification provided) [19:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:56] !log gehel@tin Finished deploy [wdqs/wdqs@81442a0]: (no justification provided) (duration: 01m 23s) [19:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:11] SMalyshev: ^ [19:52:12] Yay, love my new logging shame :p [20:00:04] tgr, dr0ptp4kt, and bblack: Dear anthropoid, the time has come. Please deploy Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T2000). [20:00:15] here [20:00:23] 06Operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#2983163 (10RobH) [20:00:26] 06Operations, 10hardware-requests: hardware request for netmon1001 - https://phabricator.wikimedia.org/T156040#2962228 (10RobH) 05Open>03stalled I've created a task (T156667) for the quotation for a replacement system. [20:01:45] bblack: OK to do the JsonConfig update now? [20:02:14] please wait, some scap debugging on, cc thcipriani [20:02:34] tgr: ^ [20:11:18] PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:12:01] !log update RESTBase to 501ea47edc in staging [20:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:06] FYI: Started an html dump on xenon [20:30:49] 06Operations, 13Patch-For-Review: fix log reading permissions for dc-ops admin group - https://phabricator.wikimedia.org/T156529#2983355 (10Dzahn) 05Open>03Resolved confirmed by papaul on install1001, and install2001 i see the same sudo rules, so should be resolved. [20:39:18] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:40:19] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2983412 (10RobH) Please note that since this is no longer an active hardware request, I'm going to remove the #project so we don't get used... [20:40:23] dr0ptp4kt: I'll reschedule [20:40:51] tgr: thx [20:43:00] !log mobrovac@tin Started deploy [trending-edits/deploy@9addcd0]: Bump max_age to 18h for T156411 [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:05] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [20:44:26] (03CR) 10Yuvipanda: [C: 032] kubernetesbackend: change absolute kubectl path to '/usr/bin/kubectl' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) (owner: 10Zhuyifei1999) [20:44:50] dapatrick: Can I get you to say "yes" on https://phabricator.wikimedia.org/T132063 please? [20:45:04] (03Merged) 10jenkins-bot: kubernetesbackend: change absolute kubectl path to '/usr/bin/kubectl' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) (owner: 10Zhuyifei1999) [20:45:19] Or, you know, "you're a terrible human being and this needs more work", whatever works [20:45:40] !log mobrovac@tin Finished deploy [trending-edits/deploy@9addcd0]: Bump max_age to 18h for T156411 (duration: 02m 39s) [20:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:03] (03PS3) 10Thcipriani: Scap: Bump version to 3.5.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/334677 (https://phabricator.wikimedia.org/T127762) [20:46:18] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=6699): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa2da0aa890: Failed to establish a new connection: [Errno 111] Connection refused,)) [20:47:09] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [20:47:12] known ^ [20:50:23] !log thcipriani@tin Synchronized README: test scap (duration: 00m 43s) [20:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:38] !log uploaded scap 3.5.1-1 [20:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:57] ACKNOWLEDGEMENT - trendingedits endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=6699): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Marko Obrovac configuration issues, working on it [20:52:13] (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.5.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/334677 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [20:54:48] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:54:59] marktraceur, Done. [20:55:20] dapatrick: Super, thanks a lot! [20:55:34] I think I still need to get a review for the extension, but at least this part is done... [20:55:47] !log mobrovac@tin Started deploy [trending-edits/deploy@5735f00]: Bump memory limit and heartbeat timeout [20:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:35] !log mobrovac@tin Finished deploy [trending-edits/deploy@5735f00]: Bump memory limit and heartbeat timeout (duration: 01m 48s) [20:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:22] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2983488 (10Tgr) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T2100). [21:06:05] !log mobrovac@tin Started deploy [trending-edits/deploy@5735f00]: (no justification provided) [21:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:12] !log mobrovac@tin Finished deploy [trending-edits/deploy@5735f00]: (no justification provided) (duration: 03m 07s) [21:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:00] no mobileapps deploy today [21:10:01] !log mobrovac@tin Started deploy [trending-edits/deploy@5735f00]: (no justification provided) [21:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:14] !log mobrovac@tin Finished deploy [trending-edits/deploy@5735f00]: (no justification provided) (duration: 03m 13s) [21:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:22] 06Operations, 06Multimedia, 10Wikimedia-Site-requests, 07Performance: Choose a sensible set of thumbnail sizes for Special:Preferences - https://phabricator.wikimedia.org/T106640#2983536 (10Quiddity) There's a larger list of options (not including the one above) at https://www.mediawiki.org/wiki/Requests_f... [21:42:53] (03PS1) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [21:43:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [21:47:35] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2983643 (10Tgr) [21:48:31] (03PS2) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [21:49:47] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [21:51:48] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [21:51:48] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:54:48] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [21:54:48] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [21:54:48] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170130T2200). [22:01:09] (03PS3) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [22:01:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 (owner: 10EBernhardson) [22:07:48] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:22:18] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 63776 MB (12% inode=99%) [22:22:48] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [22:30:51] thcipriani: would be nice if you can update https://wikitech.wikimedia.org/wiki/How_to_deploy_code [22:31:09] * thcipriani looks [22:32:48] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [22:33:14] matanya: will do, thanks for pointing it out. Most commands should still work the same, I'll get rid of references for scap sync-dir though. [22:33:37] yeah, that was my main point, thanks for the release [22:33:48] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:34:48] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:39:18] RECOVERY - Disk space on elastic1019 is OK: DISK OK [22:55:48] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [22:58:05] thcipriani: Relatedly... https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS seems like a completely useless section [22:58:15] 06Operations, 10Wikimedia-Stream: Error on RCSteam server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#2983908 (10Krinkle) [22:58:49] I've pointed folks at that section in the past [22:59:35] People think we have to re-minify CSS/JS by hand somewhere? [22:59:39] Ok...nvm then.... [23:00:25] matanya: For what it's worth, `scap sync-dir` still works, it's just a back-compat alias and is hidden from scap's help [23:00:33] (hidden option) [23:00:46] But yes, doc improvements to remove references are good :) [23:00:50] 06Operations, 10Wikimedia-Stream: rcstream service - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#2890586 (10Krinkle) [23:01:15] ostriches: i am just nagging around as usual ;) [23:01:33] Oh no worries, I'm just adding noise/background :) [23:01:48] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:01:55] we can build a band [23:03:46] ostriches: thcipriani: 2011 ;) https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=48117&oldid=48116 [23:04:21] Yes, I remember $wgStyleVersion :) [23:04:47] "there is no need to e.g manually do a "build" (to re-minify/re-cache static files)" -- style version didn't really do that :) [23:05:01] Eh, cache busted, I suppose [23:05:05] But minify feels wrong there [23:05:08] Oh well, I'm nitpicking [23:05:31] Also: I miss pre-RL days :( [23:06:15] ostriches: since this was edited by neilk, i have extra context for you: UploadWizard use to have manually minified JS and CSS code. [23:06:29] (neilk worked on it) [23:06:58] Silly uploadwizard [23:07:01] :D [23:08:27] (03CR) 10Krinkle: "+2 for stream.wm.o re-use. However this means it'll have to route after DNS/LVS, at the Varnish level. Since there appears to be a separat" [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [23:13:58] (03CR) 10Krinkle: "(meant +1 :))" [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [23:34:18] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:59] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2984081 (10RobH)