[00:01:29] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [00:29:29] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:56:09] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:09:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [01:24:09] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:37:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [01:38:07] (03CR) 10Tim Landscheidt: [C: 031] puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [01:38:19] (03CR) 10Tim Landscheidt: [C: 031] toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [01:38:31] (03CR) 10Tim Landscheidt: [C: 031] toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111 (owner: 10Juniorsys) [01:59:29] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.35, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f90e6f74950: Failed to establish a new connection: [Errno 111] Connection refused,)) [02:00:09] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:00:09] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:00:09] PROBLEM - cassandra service on restbase-dev1002 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running [02:00:09] PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:00:09] PROBLEM - cassandra SSL 10.64.32.112:7001 on restbase-dev1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:00:19] PROBLEM - cassandra CQL 10.64.32.112:9042 on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 9042: Connection refused [02:00:19] PROBLEM - Restbase root url on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 7231: Connection refused [02:00:19] PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused [02:20:09] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:21:27] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 08m 04s) [02:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 16 02:25:46 UTC 2017 (duration 4m 21s) [02:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:09] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:04:09] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [03:04:09] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [03:07:09] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:07:09] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:21:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 721.01 seconds [03:24:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 277.05 seconds [03:34:09] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:09] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:06:09] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:55] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2942156 (10Shoichi) >>! In T148693#2936504, @Arthur2e5 wrote: > Yes. > > * * * > > 2017-01-14, too lazy to add a comment: > > Well, that's just an exa... [05:35:09] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [05:39:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members] [05:49:09] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:50:09] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:09] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [06:07:09] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:07:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:17:09] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:18:09] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:31:09] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:34:10] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:02:10] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:02:29] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [07:23:46] (03CR) 10Reedy: Reinstate "Remove MWVersion, fold its two functions into MWMultiVersion" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 (owner: 10Reedy) [07:29:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:29:29] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:34:09] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [07:34:09] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [07:37:09] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:37:09] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:37:09] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:45:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:49:09] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:52:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:55:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:57:21] 06Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2003 - https://phabricator.wikimedia.org/T155363#2942271 (10Volans) [07:58:30] ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdj1] Volans Broken disk: https://phabricator.wikimedia.org/T155363 [08:00:09] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:00:52] 06Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2003 - https://phabricator.wikimedia.org/T155363#2941454 (10Volans) FYI also Puppet is broken (see [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be2003&service=puppet+last+run | Icinga ]]) with: ``` Notice: /Stage[... [08:06:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:08:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:09:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:43] (03CR) 10Muehlenhoff: "Does it strictly need 5.1.40? Otherwise let's rather use the packaged 5.1.39 from Debian. mysql-connector-java had a security issue last y" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [08:13:04] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#2942281 (10MoritzMuehlenhoff) [08:13:30] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#2942293 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [08:14:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:19:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:22:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:37:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:38:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:41:09] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:49:09] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:19] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2942314 (10hashar) [08:51:34] (03PS1) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332218 (https://phabricator.wikimedia.org/T149553) [08:53:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332218 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:54:47] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332218 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:55:02] (03CR) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332218 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [08:56:16] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2034 - T149553 (duration: 00m 38s) [08:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:20] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [08:57:18] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2942328 (10Marostegui) [08:57:45] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2778754 (10Marostegui) 05Open>03Resolved a:03Marostegui Closing this ticket as the server has been repooled. Looks like the cause was the CPUs. [09:01:46] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Aside from the technical side which is that this should be using the sudo module's defines, I would like to know why the gerrit user shoul" [puppet] - 10https://gerrit.wikimedia.org/r/331998 (owner: 10Paladox) [09:04:18] Hey akosiaris o/ [09:04:55] marostegui: o/ [09:12:09] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [09:13:17] !log Compressing dewiki on db1026 - T154929 [09:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] T154929: db1026 (s5) needs some compression - https://phabricator.wikimedia.org/T154929 [09:18:09] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:19:44] (03PS4) 10Muehlenhoff: Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) [09:20:47] (03CR) 10Hashar: [C: 031] contint module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332096 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [09:22:06] (03CR) 10Muehlenhoff: [C: 032] Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [09:34:25] (03CR) 10Muehlenhoff: [C: 031] "Seems fine" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [09:43:04] (03PS2) 10Juniorsys: contint module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332096 (https://phabricator.wikimedia.org/T93645) [09:45:45] (03PS2) 10Juniorsys: toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111 [09:46:41] (03PS2) 10Juniorsys: toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645) [09:47:27] (03PS2) 10Juniorsys: puppetmaster module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332105 (https://phabricator.wikimedia.org/T93645) [09:48:11] (03PS3) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [09:48:57] (03PS2) 10Juniorsys: varnish module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332113 (https://phabricator.wikimedia.org/T93645) [09:50:36] (03PS2) 10Juniorsys: torrus module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332112 (https://phabricator.wikimedia.org/T93645) [09:51:21] (03PS2) 10Juniorsys: statistics module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332109 (https://phabricator.wikimedia.org/T93645) [09:52:01] (03PS2) 10Juniorsys: snapshot module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332108 (https://phabricator.wikimedia.org/T93645) [09:52:43] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#2942419 (10MoritzMuehlenhoff) [09:52:46] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2942417 (10MoritzMuehlenhoff) 05Open>03Resolved All done [09:53:42] (03PS2) 10Juniorsys: site.pp - Use full class names, not relative ones [puppet] - 10https://gerrit.wikimedia.org/r/332107 (https://phabricator.wikimedia.org/T93645) [09:54:38] (03PS2) 10Juniorsys: role analytics_cluster: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332106 (https://phabricator.wikimedia.org/T93645) [09:56:06] (03PS2) 10Juniorsys: postgresql module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332104 (https://phabricator.wikimedia.org/T93645) [09:56:45] (03PS2) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) [09:57:55] (03PS2) 10Juniorsys: install_server module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332102 [09:58:39] (03PS2) 10Juniorsys: ganglia module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332100 (https://phabricator.wikimedia.org/T93645) [09:59:39] (03PS2) 10Juniorsys: druid module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332099 (https://phabricator.wikimedia.org/T93645) [10:00:24] (03PS2) 10Juniorsys: diamond module: Add trailing commas [puppet] - 10https://gerrit.wikimedia.org/r/332098 (https://phabricator.wikimedia.org/T93645) [10:01:12] (03PS2) 10Juniorsys: dataset module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332097 (https://phabricator.wikimedia.org/T93645) [10:02:03] (03PS2) 10Juniorsys: conftool module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332095 (https://phabricator.wikimedia.org/T93645) [10:02:48] (03PS2) 10Juniorsys: authdns: Add trailing comma [puppet] - 10https://gerrit.wikimedia.org/r/332093 (https://phabricator.wikimedia.org/T93645) [10:03:14] <_joe_> uh, talk bout a massive rebase :P [10:03:32] :) [10:03:40] (03PS2) 10Juniorsys: bacula module: Trailing commas, full class names [puppet] - 10https://gerrit.wikimedia.org/r/332094 (https://phabricator.wikimedia.org/T93645) [10:18:03] (03PS1) 10Muehlenhoff: Update to 4.4.42 [debs/linux44] - 10https://gerrit.wikimedia.org/r/332221 [10:28:30] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.42 [debs/linux44] - 10https://gerrit.wikimedia.org/r/332221 (owner: 10Muehlenhoff) [10:30:38] !log Compressing pagelinks tables on db1038 - T154465 [10:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:43] T154465: Defragment db1038 - https://phabricator.wikimedia.org/T154465 [10:34:04] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [10:34:04] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [10:35:37] !log Compressing templatelinks tables on db1044 (depooled) - T153826 [10:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:41] T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826 [10:40:01] (03PS1) 10Muehlenhoff: Update to 4.4.43 [debs/linux44] - 10https://gerrit.wikimedia.org/r/332222 [11:04:43] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.43 [debs/linux44] - 10https://gerrit.wikimedia.org/r/332222 (owner: 10Muehlenhoff) [11:05:04] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [11:17:31] 06Operations, 10ops-eqiad: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2942505 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [11:19:36] 06Operations, 10ops-eqiad, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2942507 (10MoritzMuehlenhoff) a:03Cmjohnson [11:21:20] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Logstash, 06Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782#2942508 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:22:20] 06Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2003 - https://phabricator.wikimedia.org/T155363#2942509 (10MoritzMuehlenhoff) a:03Papaul [11:24:08] 06Operations, 10Ops-Access-Requests: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2942512 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:24:18] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive/webrequest data for demon - https://phabricator.wikimedia.org/T155198#2942513 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:24:27] 06Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for anomie - https://phabricator.wikimedia.org/T155143#2942514 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:28:03] access to Hadoop everywhere :D [11:30:01] (03PS1) 10Marostegui: WIP: Split dbstore role [puppet] - 10https://gerrit.wikimedia.org/r/332228 [11:31:54] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [11:33:04] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:34:20] (03CR) 10Marostegui: "What about this: https://gerrit.wikimedia.org/r/#/c/332228/ this might be a good temporary (or maybe not temporary) solution" [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [11:34:41] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5092/" [puppet] - 10https://gerrit.wikimedia.org/r/332228 (owner: 10Marostegui) [11:47:10] (03CR) 10Alexandros Kosiaris: "Yes and no. No short term, yes long term. I 'll abandon for now and revive when/if required" [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [11:47:22] (03Abandoned) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [11:59:54] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:04:00] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10fgiunchedi) Does this happen on every server side upload? On what files and times? I'm asking to better track down the error in MW logs [12:26:31] !log installing pdns-recursor security updates [12:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:16] 06Operations, 10Traffic, 10Wikidata, 07HTTPS: wikiba.se should use HTTPS - https://phabricator.wikimedia.org/T155359#2941299 (10Esc3300) T153563 is still open as well. [12:34:14] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:03:14] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:03:25] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#2942665 (10MoritzMuehlenhoff) [13:12:53] !log installing pysaml2 security updates [13:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:05] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, I'll merge this on Wednesday unless there are any further objections." [puppet] - 10https://gerrit.wikimedia.org/r/331925 (https://phabricator.wikimedia.org/T155198) (owner: 10Chad) [13:38:53] 06Operations, 10Monitoring, 10Traffic, 07Wikimedia-Incident: Plot number of cached objects on a per-server per-DC basis - https://phabricator.wikimedia.org/T154864#2942730 (10fgiunchedi) The number of objects in varnish for frontend/backend is now also available at https://grafana.wikimedia.org/dashboard/... [13:41:55] jouncebot: next [13:42:21] jouncebot: update [13:42:24] jouncebot: refresh [13:42:26] I refreshed my knowledge about deployments. [13:42:39] jouncebot: next [13:42:50] zeljkof: so jouncebot is around [13:43:02] but empty since there is no calendar yet on the wiki [13:44:23] (03PS1) 10Muehlenhoff: Add anomie (Brad Jorsch) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/332323 (https://phabricator.wikimedia.org/T155143) [13:45:38] hashar, just a question. Will I be able to schedule a few of patches for deployment in 15 minutes? There isn't calendar on the wiki so I don't know... [13:46:33] I need to deploy patches for T155345, T155278, T155309 and T155301. [13:46:33] T155309: Please add www.leventhalmap.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T155309 [13:46:34] T155301: Set wgBabelMainCategory for cswikiversity to Uživatel %code% - https://phabricator.wikimedia.org/T155301 [13:46:34] T155345: Request for a temporary lift of account creation cap on IP - https://phabricator.wikimedia.org/T155345 [13:46:34] T155278: Namespace aliases on Bhojpuri Wikipedia (bhwiki) - https://phabricator.wikimedia.org/T155278 [13:47:47] (03Abandoned) 10Urbanecm: Add ftpmirror.your.org to whitelist of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328036 (https://phabricator.wikimedia.org/T153569) (owner: 10Urbanecm) [13:52:19] (03CR) 10Muehlenhoff: [C: 032] Add anomie (Brad Jorsch) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/332323 (https://phabricator.wikimedia.org/T155143) (owner: 10Muehlenhoff) [14:01:54] Urbanecm: I will do thme [14:01:59] Okay. Thanks [14:02:11] BTW why isn't the SWAT in the calendar? [14:04:22] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for anomie - https://phabricator.wikimedia.org/T155143#2942779 (10MoritzMuehlenhoff) 05Open>03Resolved @Anomie I've added you to the group, you should now be able to log into stat1004.eqiad.wmnet. Pl... [14:05:20] 06Operations, 07Availability, 15User-Elukey, 07Wikimedia-Incident: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2942781 (10elukey) [14:06:41] Urbanecm: doing them now [14:07:14] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:04] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [14:08:58] Hi. [14:09:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2934619 keys, up 77 days 5 hours - replication_delay is 0 [14:11:15] (03CR) 10Hashar: [C: 04-1] "License is unclear on upstream site and for a few imports I would prefer we do not use $wgCopyUpload. It is meant for mass imports really" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332053 (https://phabricator.wikimedia.org/T155309) (owner: 10Urbanecm) [14:11:57] (03PS2) 10Hashar: Set wgBabelMainCategory for cswikiversity to Uživatel %code% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332046 (https://phabricator.wikimedia.org/T155301) (owner: 10Urbanecm) [14:12:04] hashar, okay, I'll remember it for the copyUpload. [14:12:19] Urbanecm: there was some discussion about clearing out the entries there [14:12:25] seems most were for one shot entries [14:12:39] but the real concern is the license of materials on their site, it is unclear [14:12:48] Okay. [14:12:58] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332046 (https://phabricator.wikimedia.org/T155301) (owner: 10Urbanecm) [14:13:30] if in public domain, at best could be a kind request to credit curators, at worst could be a bogus public domain appropriation claim [14:14:09] according http://www.leventhalmap.org/explore/subject they have a lot < 1923 content (so in PD in US) [14:14:13] (03Merged) 10jenkins-bot: Set wgBabelMainCategory for cswikiversity to Uživatel %code% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332046 (https://phabricator.wikimedia.org/T155301) (owner: 10Urbanecm) [14:14:28] (03CR) 10jenkins-bot: Set wgBabelMainCategory for cswikiversity to Uživatel %code% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332046 (https://phabricator.wikimedia.org/T155301) (owner: 10Urbanecm) [14:14:30] The request is probably legit, but original requester could indeed clarify a little bit. [14:14:54] Urbanecm: "Set wgBabelMainCategory for cswikiversity to Uživatel %code%" is on mwdebug1001 now [14:15:03] hashar, okay [14:15:41] (03PS2) 10Hashar: Add one throttle rule + remove obsolete ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) (owner: 10Urbanecm) [14:15:50] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) (owner: 10Urbanecm) [14:17:24] hashar, works [14:17:29] (03Merged) 10jenkins-bot: Add one throttle rule + remove obsolete ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) (owner: 10Urbanecm) [14:17:47] \O/ [14:17:55] (03CR) 10jenkins-bot: Add one throttle rule + remove obsolete ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332134 (https://phabricator.wikimedia.org/T155345) (owner: 10Urbanecm) [14:18:24] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: wgBabelMainCategory for cswikiversity to Uživatel %code% T155301 (duration: 00m 39s) [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:29] T155301: Set wgBabelMainCategory for cswikiversity to Uživatel %code% - https://phabricator.wikimedia.org/T155301 [14:19:58] (03PS2) 10Hashar: Namespace aliases on Bhojpuri Wikipedia (bhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332015 (https://phabricator.wikimedia.org/T155278) (owner: 10Urbanecm) [14:20:02] !log hashar@tin Synchronized wmf-config/throttle.php: Add one throttle rule + remove obsolete ones T155345 (duration: 00m 38s) [14:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:06] T155345: Request for a temporary lift of account creation cap on IP - https://phabricator.wikimedia.org/T155345 [14:21:26] godog: can't repro, I made a test, it works today. Yes, Friday, around 15h UTC, everyone failed. [14:21:43] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332015 (https://phabricator.wikimedia.org/T155278) (owner: 10Urbanecm) [14:21:50] godog: I'll comment, the task after some more files [14:23:11] (03Merged) 10jenkins-bot: Namespace aliases on Bhojpuri Wikipedia (bhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332015 (https://phabricator.wikimedia.org/T155278) (owner: 10Urbanecm) [14:23:15] Dereckson: ok thanks! [14:23:22] (03CR) 10jenkins-bot: Namespace aliases on Bhojpuri Wikipedia (bhwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332015 (https://phabricator.wikimedia.org/T155278) (owner: 10Urbanecm) [14:23:39] Urbanecm: Namespace aliases on Bhojpuri Wikipedia (bhwiki) [14:23:41] [14:23:42] is on mwdebug1001 [14:24:59] no clue how to test that though [14:26:20] Urbanecm: I am syncing the bhwiki ns alias [14:26:49] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Namespace aliases on Bhojpuri Wikipedia (bhwiki) - T155278 (duration: 00m 41s) [14:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:54] T155278: Namespace aliases on Bhojpuri Wikipedia (bhwiki) - https://phabricator.wikimedia.org/T155278 [14:28:35] hashar, by accessing some known page in that namespace? [14:28:36] (i.e. try to access the page with U: and User: [14:28:37] ) [14:28:41] I'm working on that but I have really slow connection... [14:30:52] Okay, thanks. [14:33:48] I am trying it [14:33:52] Urbanecm: Wp:How_to_create_a_page is broken now :( [14:33:59] redirects to https://bh.wikipedia.org/wiki/विकिपीडिया:How_to_create_a_page [14:34:06] (03PS1) 10Urbanecm: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) [14:34:31] Wp with this case is expected to work? [14:35:06] Why it is broken? From my POV it works as it should... [14:35:14] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:35:15] I don't know [14:35:18] namespace dupes reports: [14:35:28] pagelinks from=37242 ns=0 dbk=Wp:How_to_create_a_page -> विकिपीडिया:How_to_create_a_page [14:35:36] I meant what "broken" means in this... [14:35:49] (03PS2) 10Urbanecm: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) [14:36:16] You have to visit the page 37242 and fix the link manually [14:36:20] ah no [14:36:39] You = a contributor of bh. [14:37:09] or we could alias Wp / WP ? [14:37:39] I think the search box should work with WP and Wp... [14:38:00] BTW hashar can you deploy 332324 too? [14:38:05] (03CR) 10Jcrespo: [C: 04-1] "@marostegui ok with the change, just overwrite this one with that one, or abandon this (worst case scenario, we just revert). The other th" [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [14:39:27] Urbanecm: white spaces!!! https://gerrit.wikimedia.org/r/#/c/332324/2/wmf-config/throttle.php ;-} [14:39:35] Urbanecm: will handle it [14:40:21] The "will handle it" mean you will remove it? [14:40:25] (03PS3) 10Hashar: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) (owner: 10Urbanecm) [14:40:25] Or should I? [14:40:42] Now I know. Thanks hsh [14:40:44] hashar, [14:40:52] (03CR) 10Hashar: [C: 032] "Fixed up a trailing whitespace. SWAT!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) (owner: 10Urbanecm) [14:42:26] (03Merged) 10jenkins-bot: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) (owner: 10Urbanecm) [14:42:36] (03CR) 10jenkins-bot: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332324 (https://phabricator.wikimedia.org/T155416) (owner: 10Urbanecm) [14:44:41] Urbanecm: it is being deployed [14:44:44] !log hashar@tin Synchronized wmf-config/throttle.php: Add a new throttle rule - T155416 (duration: 00m 38s) [14:44:45] Thanks [14:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] T155416: Request for a temporary lift of account creation cap on IP - https://phabricator.wikimedia.org/T155416 [14:45:00] (03PS1) 10Urbanecm: Change the NS_PROJECT name to "Википедия" on avwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332325 (https://phabricator.wikimedia.org/T155321) [14:46:30] Just for clarify. Are all of my patches I wanted to deploy deployed? [14:46:34] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:44] Urbanecm: yup :) [14:47:49] !log European SWAT complete [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:58] Urbanecm: I am looking at the bhwiki madness a bit [14:48:04] and will craft a report on the task [14:48:09] hashar, thanks. [14:48:13] id=28199 ns=0 dbk=WP:VP *** dest title exists and --add-prefix not specified [14:48:13] id=35228 ns=0 dbk=वि:हटावल *** dest title exists and --add-prefix not specified [14:48:14] conflicts! [14:52:45] (03CR) 10Marostegui: "> @marostegui ok with the change, just overwrite this one with that" [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [14:53:44] (03CR) 10Marostegui: "Coming from: https://gerrit.wikimedia.org/r/#/c/328671/1 - Jaime suggest to also split dbstore original role from mariadb.pp which I agree" [puppet] - 10https://gerrit.wikimedia.org/r/332228 (owner: 10Marostegui) [14:56:39] (03PS1) 10Urbanecm: Remove the botadmin group from mlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332329 (https://phabricator.wikimedia.org/T152296) [14:58:12] checking mw1167.. [15:01:31] !log restarting hhvm on mw1167 - hhvm-dump-debug in /tmp/hhvm.20360.bt [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:24] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:05:21] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2943074 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:07:19] hashar: thanks for jouncebot [15:19:53] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#2940368 (10Dereckson) Can't repro this Monday. I'm currently uploading some files and will report later in the day the status, but currently, all works like a charm. [15:35:04] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [15:52:14] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:04] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:04] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:21:14] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:22:04] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:23:57] (03PS14) 10Paladox: Gerrit: Add support for logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) [16:27:46] (03PS6) 10Paladox: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) [16:32:15] 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2943204 (10jcrespo) As promised, here it is the incident report. https://wikitech.wikimedia... [17:04:06] godog: omg awesome, for the grafana "objects in cache" graphs [17:05:58] apergos: inorite! [17:06:06] wasn't too hard to put together too [17:06:26] hard or no, it wasn't ther and now it is, so thank you! [17:10:26] yeah that's great :) [17:11:41] \o/ thanks :)) [17:19:54] 06Operations, 10ops-eqiad, 10hardware-requests: Reclaim/Decommission (specify) stat1001 - https://phabricator.wikimedia.org/T154164#2943245 (10Dzahn) [17:20:14] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:34] 06Operations, 10ops-eqiad, 10hardware-requests: Reclaim/Decommission (specify) stat1001 - https://phabricator.wikimedia.org/T154164#2902803 (10Dzahn) still in DHCP and DNS [17:49:14] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:54:22] (03PS2) 10ArielGlenn: dataset: fix Unrecognized escape sequence '\?' [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [18:01:41] (03CR) 10ArielGlenn: "Heh, so my advice was both free and wrong. This is for dataset1001 in the end (hence in the dataset module). The puppet compiler says it'" [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [18:06:13] (03CR) 10ArielGlenn: [C: 032] dataset: fix Unrecognized escape sequence '\?' [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [18:09:34] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 45, down: 0, shutdown: 0 [18:14:03] (03PS3) 10ArielGlenn: snapshot module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332108 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:14:25] (03PS1) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355 [18:19:19] (03PS2) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355 [18:20:14] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:02] (03CR) 10ArielGlenn: [C: 032] snapshot module: Use full names for class names [puppet] - 10https://gerrit.wikimedia.org/r/332108 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:24:44] (03PS3) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355 [18:25:43] (03PS3) 10ArielGlenn: dataset module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332097 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:26:25] huh, https://wikitech.wikimedia.org/wiki/Deployments is missing this week? [18:27:00] greg-g: moritzm: are we doing the train deployments as usual? is there a SWAT in half an hour? [18:30:18] us holiday so I dunno about that [18:30:25] there should be no greg [18:30:58] jouncebot next [18:31:06] jouncebot: next [18:31:14] jouncebot: next [18:31:23] well, there's nothing next [18:31:24] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [18:31:27] we should be glad it's not crashing :D [18:31:33] heh [18:32:17] half of ops is here and jetlagged, the other half is travelling or sleeping or on holiday, probably the same true for releng etc [18:33:22] i didn't realize it's a holiday. okay, i'll ask again tomorrow. it's not so pressing. ;) [18:34:21] (03CR) 10Dereckson: [C: 031] "Uploader is a trusted user on Commons and clarified request scope. The goal is to only import < 1923 pictures in public domain in US, but " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332053 (https://phabricator.wikimedia.org/T155309) (owner: 10Urbanecm) [18:35:40] ok! [18:42:05] !log uploaded nodejs 6.9.1 for jessie-wikimedia to carbon [18:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:38] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2943384 (10MoritzMuehlenhoff) @mobrovac nodejs 6.9.1 has been uploaded to carbon. @Gehel Note that Karthoterian isn't ready for node 6 yet, so we need to be careful to not... [18:48:16] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:00:24] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:02:41] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Update ci to nodejs 6 - https://phabricator.wikimedia.org/T155443#2943419 (10Paladox) [19:02:53] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Update ci to nodejs 6 - https://phabricator.wikimedia.org/T155443#2943431 (10Paladox) p:05Triage>03High [19:03:41] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Update ci to nodejs 6 - https://phabricator.wikimedia.org/T155443#2943419 (10Paladox) As this can break ci at any time because someone installs a new instance or runs apt-get update and then... [19:13:41] (03CR) 10ArielGlenn: [C: 032] dataset module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332097 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:16:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2951765 keys, up 77 days 10 hours - replication_delay is 642 [19:16:55] (03PS2) 10ArielGlenn: rsync for Erik Zachte from stat* hosts to dataset1001 other/media [puppet] - 10https://gerrit.wikimedia.org/r/331924 [19:29:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2940717 keys, up 77 days 11 hours - replication_delay is 0 [19:33:14] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:49] (03PS1) 10Hashar: contint: import rewrite rule from integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/332385 (https://phabricator.wikimedia.org/T150727) [19:39:29] (03CR) 10Hashar: "I guess that reverts https://gerrit.wikimedia.org/r/#/c/322201/3/modules/contint/templates/apache/doc.wikimedia.org.erb" [puppet] - 10https://gerrit.wikimedia.org/r/332385 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [19:39:55] (03CR) 10Hashar: "I tested that one locally and that looks fine?" [puppet] - 10https://gerrit.wikimedia.org/r/332385 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [19:49:30] (03CR) 10Hashar: "Neat! Happy it got deployed all fine and with no harm." [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [19:53:38] (03CR) 10Hashar: [C: 031] "-1 -> +1 after Dereckson/Pi clarified the intent on T155309. So all good to me now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332053 (https://phabricator.wikimedia.org/T155309) (owner: 10Urbanecm) [19:59:59] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Update ci to nodejs 6 - https://phabricator.wikimedia.org/T155443#2943539 (10hashar) [20:00:04] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2943538 (10hashar) [20:01:01] (03PS4) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355 [20:02:14] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:02:50] (03CR) 10ArielGlenn: [C: 032] rsync for Erik Zachte from stat* hosts to dataset1001 other/media [puppet] - 10https://gerrit.wikimedia.org/r/331924 (owner: 10ArielGlenn) [20:04:14] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:14:39] (03PS1) 10ArielGlenn: turn on ezachte rsync to other/media on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/332390 [20:24:14] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:31] 06Operations, 10Ops-Access-Requests: Request to access hadoop (stat1004) for Ladsgroup - https://phabricator.wikimedia.org/T155303#2943556 (10Nuria) Approved pending nda check. [20:28:13] (03PS5) 10Giuseppe Lavagetto: base: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/332355 [20:32:20] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2943563 (10hashar) > nodejs 6.9.1 has been uploaded to carbon. I have learned about that upgrade literally a minute ago via T155443. That is a breaking change for CI! On... [20:33:14] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:44:13] (03CR) 10ArielGlenn: [C: 032] turn on ezachte rsync to other/media on dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/332390 (owner: 10ArielGlenn) [20:46:25] 06Operations, 10Continuous-Integration-Infrastructure: (Nodepool) CI is really slow tonight - https://phabricator.wikimedia.org/T155444#2943438 (10hashar) [20:52:14] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:54:15] 06Operations, 10Continuous-Integration-Infrastructure: (Nodepool) CI is really slow tonight - https://phabricator.wikimedia.org/T155444#2943615 (10hashar) One other possibility is we had a lot of changes merged for oojs/ui. On merge that runs `oojs-ui-coverage` but only one copy of the job can run on the infr... [20:56:36] (03CR) 10ArielGlenn: "It's bitrotten. It will be a good while before I can get back to it." [software] - 10https://gerrit.wikimedia.org/r/233478 (owner: 10ArielGlenn) [21:08:57] 06Operations, 10Ops-Access-Requests: Requesting to change the production public key - https://phabricator.wikimedia.org/T155449#2943623 (10Pchelolo) [21:09:57] 06Operations, 10Traffic: convert dumps to use Letsencrypt for SSL cert (deadline: 2017-04-26) - https://phabricator.wikimedia.org/T154940#2943636 (10ArielGlenn) The first monthly dumps run is complete. Do the conversion in the next couple days and you are golden! [21:10:49] 06Operations, 10Continuous-Integration-Infrastructure: (Nodepool) CI is really slow tonight - https://phabricator.wikimedia.org/T155444#2943438 (10matmarex) I've been submitting a lot of OOjs UI changes today, and James has been merging a lot of them. Sorry if we overwhelmed the CI. :D [21:12:32] 06Operations, 10Continuous-Integration-Infrastructure: (Nodepool) CI is really slow tonight - https://phabricator.wikimedia.org/T155444#2943651 (10matmarex) (There were 24 changesets submitted in the last 3 hours, some with multiple patchsets. https://gerrit.wikimedia.org/r/#/q/project:oojs/ui) [21:38:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2944269 keys, up 77 days 13 hours - replication_delay is 619 [21:39:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2940121 keys, up 77 days 13 hours - replication_delay is 0 [22:27:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:28:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2941273 keys, up 77 days 14 hours - replication_delay is 0 [23:16:07] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, 15User-Addshore: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2943898 (10Bawolff)